Sale!

CS 422/622 Project 1 solution

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (5 votes)

1 Python Data Storage Methods (15 Points)
File name: data_storage.py
Implement a function in python:
build_nparray(data)
That takes a 2D array of string values. The goal is to take the values populating the feature vectors and
convert them to a 2D numpy array of data type float. The label data points will be turned into their own
1D numpy array of data type int. The header values should be skipped. The function will return these two
arrays as the training feature data and the training label data. For further reading on numpy arrays go to
https://www.geeksforgeeks.org/basics-of-numpy-arrays/.
Implement a function in python:
build_list(data)
That takes a 2D array of string values. Convert the feature vector string variables into a 2D list of lists with data
type float. Convert the label data values into a 1D list of data type int. Return both lists as the training feature
data and the training label data. For further reading on lists go to https://www.geeksforgeeks.org/python-lists/.
Implement a function in python:
build_dict(data)
That takes a 2D array of string values. The feature variables should be converted to data type float, and
each data point in the sample (each row of the 2D array) should be created with a key value pair where the key
is the corresponding feature from the csv file header. For example, a single entry in the dictionary would read
like {feature_1: 1}. The dictionary for the labels should have the keys for each label be their index position in
the array. For further reading on lists go to https://www.geeksforgeeks.org/python-dictionary/.
2 Decision Trees (50 Points)
File name: decision_trees.py
2
Implement a function in python:
DT_train_binary(X,Y,max_depth)
that takes training data as input. The labels and the features are binary, but the feature vectors can be of
any finite dimension. The training feature data (X) can be structured as a 2D numpy array, with each row
corresponding to a single sample. The training labels (Y) can be structured as a 1D numpy array, with each
element corresponding to a single label. Y should have the same number of elements as X has rows. max_depth is
an integer that indicates the maximum depth for the resulting decision tree. DT_train_binary(X,Y,max_depth)
should return the decision tree generated using information gain, limited by some given maximum depth. If
max_depth is set to -1 then learning only stops when we run out of features or our information gain is 0. You
may store a decision tree however you would like, i.e. a list of lists, a class, a dictionary, etc. Bulleted lists are
okay! Binary data for testing can be found in in data_1.csv, and cat_dog_data.csv
Implement a function in python:
DT_test_binary(X,Y,DT)
that takes test data X and test labels Y and a learned decision tree model DT, and returns the accuracy (from 0
to 1) on the test data using the decision tree for predictions.
Implement a function in python:
DT_make_prediction(x,DT)
This function should take a single sample and a trained decision tree and return a single classification. The
output should be a scalar value.
622 Implement a function in python:
DT_train_real(X,Y,max_depth)
DT_test_real(X,Y,DT)
These functions are defined similarly to those above except that the features are now real values. The labels are still binary. Your decision tree will need to use questions with inequalities: >, ≥, <, ≤. Real-valued
data for testing is provided in data_2.csv
Write-up: Did you implement your decision tree functions iteratively or recursively? Which data structure
did you choose and were you happy with that choice? If you were unhappy with that choice which other data
structure would have built the model with?
3 Random Forests (25 Points)
Explanation: Random forests are the more practical application of decision trees in the real world. Random
forests are an ensemble learning method that uses a voting system to determine the most appropriate prediction.
They generate a number of decision trees where each is individually trained, before passing the prediction sample
to all the trees. Whichever prediction value generates the most ‘votes’ is the selected final prediction. For reading on the explanations of what random forests are in more detail, https://www.ibm.com/cloud/learn/randomforest. For more reading on example applications of random forests,
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. Note, you are
still expected to implement the functions from scratch.
File name: add the functions to the decision_trees.py
3
Implement a function in python:
RF_build_random_forest(X,Y,max_depth,num_of_trees)
that takes training data (X) and training labels (Y). max_depth is used to define how many layers deep each
individual tree will be allowed to build towards, while num_of_trees determines the total number of trees that
are built within the forest. You are allowed to test with a different number of trees, but the project will be
graded with the assumption the model is being built with 11 trees. Each tree’s training data will be built from
a random sampling of 10% of the total data found in the provided file (approximately 30 samples). Each tree
should output its individual accuracy. If done correctly, there should be decent variance in the accuracy of the
trees. The function should return a data structure (array, list, dict) of all the individually trained trees. The
data for the random forest function will come from haberman.csv.
Write-up: Why might the individual trees have such variance in their accuracy? How would you reduce
this variance and potentially improve accuracy?
Implement a function in python:
RF_test_random_forest(X,Y,RF)
that takes training data (X) and training labels (Y). RF is the fully generated and trained random forest. To test
the forest, all eleven trees should run the prediction function on an individual sample. Each tree will output a
prediction, and the prediction with the majority is the final prediction for the forest. This should be done for
all samples to determine the accuracy.
Write-up: Why is it beneficial for the random forest to use an odd number of individual trees?
4 Report (10 Points)
Write-up: Overall, if you are still feeling uncomfortable working with python, what aspect of the coding language do you feel you are struggling with the most? If you do feel comfortable, what part of python do you feel
you should continue practicing?
622 is required to use Latex for their report write-up. 422 can create a general README.txt but will be
awarded extra-credit for using Latex. https://www.overleaf.com/ is an excellent tool for learning and creating
Latex documents. If using Latex for the project report name the file Project1.pdf
4