Description

5/5 - (6 votes)

Comparing Methods for Speed Dating Classification In this programming assignment, you will be asked to implement Logistic Regression and Linear SVM for the classification task that you explored in Assignment 2, and then compare the performance of different classifiers. Similar as that in Assignment 2, you must design and implement your own versions of the algorithm in Python for this assignment. DO NOT use any publicly available code including libraries such as sklearn. Your code will be checked against public implementations. In addition, we will NOT provide separate testing data to you. You are asked to design your own tests to ensure that your code runs correctly and meets the specifications below. Note: You may use the pandas, numpy, scipy libraries for data processing purposes. The only restriction is that you have to write your own version of data mining algorithms; you can not use any built-in functions for your algorithm. This is a general rule for this assignment and all the upcoming ones as well. As before, you should submit your typed assignment report as a pdf along with your source code file. (Please don’t forget to put your evaluation and analysis in .pdf format, make sure to also include the screenshot of the results and/or the output of the problems.) In the following sections, we specify a number of steps you are asked to complete for this assignment. Note that all results in sample outputs are fictitious and for representation only. 1 Preprocessing (4 pts) Consider the data file dating-full.csv that you used in Assignment 2. For this assignment, we will only consider the first 6500 speed dating events in this file. That is, you can discard the last 244 lines of the file. Write a Python script named preprocess-assg3.py which reads the first 6500 speed dating events in dating-full.csv as input and performs the following operations. (i) Repeat the preprocessing steps 1(i), 1(ii) and 1(iv) that you did in Assignment 2. (You can reuse the code there and you are not required to print any outputs.) (ii) For the categorical attributes gender, race, race o and field, apply one-hot encoding. Sort the values of each categorical attribute lexicographically/alphabetically before you start the encoding process, and set the last value of that attribute as the reference (i.e., the last value of that attribute will be mapped to a vector of all zeros). You are then asked to print as outputs the mapped vectors for ‘female’ in the gender column, for ‘Black/African American’ in the race column, for ‘Other’ in the race o column, and for ‘economics’ in the field column. • Expected output lines: Mapped vector for female in column gender: [vector-for-female]. Mapped vector for Black/African American in column race: [vector-forBlack/African American]. Mapped vector for Other in column race o: [vector-for-other]. Mapped vector for economics in column field: [vector-for-economics]. 1 Additional note on one-hot encoding: The key point is that you will transform a categorical variable with n unique values into n − 1 binary variables (and you can think of this as mapping one value to a vector of binary values). The vector we ask you to print is precisely the sequence of n − 1 binary variable values that correspond to the specified value of a categorical variable. You don’t need to do anything with the reference vector. It is just a way for us to make sure that all students use the same categorical variable value as the reference, so the encoding result will be the same across all students. (iii) Use the sample function from pandas with the parameters initialized as random state = 25, frac = 0.2 to take a random 20% sample from the entire dataset. This sample will serve as your test dataset, which you should output in testSet.csv; the rest will be your training dataset, which you should output in trainingSet.csv. (Note: The use of the random state will ensure all students have the same training and test datasets; incorrect or no initialization of this parameter will lead to non-reproducible results). In summary, below are the sample inputs and outputs we expect to see. We expect 4 lines of outputs (the outputs below are fictitious) as well as 2 new .csv files (trainingSet.csv and testSet.csv) produced: $python preprocess-assg3.py Mapped vector for female in column gender: [1] Mapped vector for Black/African American in column race: [0 0 1 0 0] Mapped vector for Other in column race o: [0 0 0 1 0] Mapped vector for economics in column field: [0 0 0 0 0 0 0 0] 2 Implement Logistic Regression and Linear SVM (16 pts) Please put your code for this question in a file called lr svm.py. This script should take three command-line-arguments as input as described below: 1. trainingDataFilename: the set of data that will be used to train your algorithms (e.g., trainingSet.csv). 2. testDataFilename: the set of data that will be used to test your algorithms (e.g., testSet.csv). 3. modelIdx: an integer to specify the model to use for classification (LR= 1 and SVM= 2). Note: Please, refer to the lecture slides on Brightspace for the pseudocode of these algorithms rather than referring to other online sources. Also, when implementing Gradient Descent, DO NOT implement stochastic gradient descent, and make sure to follow the values given for the parameters to be used by each algorithm as described below: (i) Write a function named lr(trainingSet, testSet) which takes the training dataset and the testing dataset as input parameters. The purpose of this function is to train a logistic regression classifier using the data in the training dataset, and then test the classifier’s performance on the testing dataset. Use the following setup for training the logistic regression classifier: (1) Use L2 regularization, with λ = 0.01. Optimize with gradient descent, using an initial weight vector of all zeros and a step size of 0.01. (2) Stop optimization after a maximum number of iterations max = 500, or 2 when the L2 norm of the difference between new and old weights is smaller than the threshold tol = 1e − 6, whichever is reached first. Print the classifier’s accuracy on both the training dataset and the testing dataset (rounded to two decimals). (ii) Write a function named svm(trainingSet, testSet) which takes the training dataset and the testing dataset as input parameters. The purpose of this function is to train a linear SVM classifier using the data in the training dataset, and then test the classifier’s performance on the testing dataset. Use the following setup for training the SVM: (1) Use hinge loss. Optimize with subgradient descent, using an initial weight of all zeros, a step size of 0.5 and a regularization parameter of λ = 0.01. (2) Stop optimization after a maximum number of iterations max = 500, or when the L2 norm of the difference between new and old weights is smaller than the threshold tol = 1e − 6, whichever is reached first. Print the classifier’s accuracy on both the training dataset and the testing dataset (rounded to two decimals). The sample inputs and outputs we expect to see are as follows (the numbers are fictitious): $python lr svm.py trainingSet.csv testSet.csv 1 Training Accuracy LR: 0.71 Testing Accuracy LR: 0.68 $python lr svm.py trainingSet.csv testSet.csv 2 Training Accuracy SVM: 0.75 Testing Accuracy SVM: 0.74 3 Learning Curves and Performance Comparison (10 pts) In this part, you are asked to use incremental 10-fold cross validation to plot learning curves for different classifiers (NBC, LR, SVM), with training sets of varying size but constant test set size. You are then asked to compare the performance of different classifiers given the learning curves. To prepare a NBC classifier for this part, please follow the steps below: • Prepare a preprocessed dataset for NBC utilizing the first 6500 rows from the file datingfull.csv. Conduct preprocessing & discretization like you did at Assignment 2 (you can reuse your code from Assignment 2). You can use a bin size of 5. The two preprocessed datasets (i.e., the one for NBC and the one for LR & SVM) will be identical in every way except one of them will use discretization/label encoding while the other will use one-hot encoding on some attributes. • Use the sample function from pandas with the parameters initialized as random state = 25, frac = 0.2 to take a random 20% sample from the entire preprocessed dataset. This sample will serve as your test dataset, which you should output in testSet NBC.csv; the rest will be your training dataset, which you should output in trainingSet NBC.csv. The only datasets you should use for conducting the incremental 10-fold cross validation are trainingSet.csv and trainingSet NBC.csv. In the following, we use “training data” to refer to both of them, but you should use the data in trainingSet.csv when training LR & SVM, and use the data in trainingSet NBC.csv when training NBC. Put your code for this question in a file named cv.py. 3 (i) Use the sample function from pandas with the parameters initialized as random state = 18, frac = 1 to shuffle the training data. Then partition the training data into 10 disjoint sets S = [S1, …, S10], where S1 contains training samples with index from 1 to 520 (i.e., the first 520 lines of training samples after shuffling), and S2 contains samples with index from 521 to 1040 (i.e., the second 520 lines of training samples after shuffling) and so on. Each set has 520 examples. (ii) For each t frac ∈ {0.025, 0.05, 0.075, 0.1, 0.15, 0.2}: (a) For idx = [1..10] i. Let test set = Sidx. ii. Let SC = S Si=[1..10],i6=idx. iii. Construct train set by taking a random t frac fraction of training examples from SC. Use the sample function from pandas with the parameters initialized as random state = 32, frac = t frac to generate this training set. iv. Learn each model (i.e., NBC, LR, SVM) from train set. (Please use your own NBC implementation from Assignment 2.) v. Apply each of the learned model to test set and measure the model’s accuracy. (Please remember that for NBC the test set will be generated from the file trainingSet NBC.csv.) (b) For each model (i.e., NBC, LR, SVM), compute the average accuracy over the ten trials and its standard error. Standard error is the standard deviation divided by the square root of the number of trials (in our case it’s 10). For example, for a sequence of numbers L = [0.16, 0.18, 0.19, 0.15, 0.19, 0.21, 0.21, 0.16, 0.18, 0.16], the standard deviation of L is σL = 0.021, and the standard error is: sterrL = σL sqrt(num trials) = 0.007 (iii) Plot the learning curves for each of the three models in the same plot based on the incremental 10-fold cross validation results you have obtained above. Use x-axis to represent the size of the training set (i.e., t frac ∗|SC|) and y-axis to represent the model accuracy. Use error bars on the learning curves to indicate ±1 standard error. (iv) Formulate a hypothesis about the performance difference between at least two of the models (Any pair of the 3 models can be used to form your hypothesis). (v) Test your hypothesis and discuss whether the observed data support your hypothesis (i.e., are the observed differences significant). Submission Instructions: Submit through Brightspace 1. Include in your report which version of Python you are using. (There is no restriction on Python version other than it should be Python 3. Generally, a Python version over 3.6 is acceptable.) 2. Make sure you include in your report all the output / results you get from running your code for all sub-questions. You may include screen shots to show them. 4 3. Make a directory named yourF irstN ame yourLastN ame HW3 and copy all of your files to this directory. 4. DO NOT put the datasets into your directory. 5. Make sure you compress your directory into a zip folder with the same name as described above, and then upload your zip folder to Brightspace. (Multiple submissions allowed and the latest submission will be graded.) Your submission should include the following files: 1. The source code in python. 2. Your evaluation & analysis in .pdf format. Note that your analysis should include visualization plots as well as a discussion of results, as described in details in the questions above. 3. A README file containing your name, instructions to run your code and anything you would like us to know about your program (like errors, special conditions, etc). 5

CS 57300: Assignment 3 solved

Download Details:

Description

CS 57300: Assignment 3 solved

Download Details:

Description

Related products

CS 57300: Assignment 5 solved

CS 57300 Assignment 2 solved

CS57300: Assignment 4 solved