Sale!

CS 5590 Assignment 1 Foundations of Machine Learning solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (7 votes)

1 Questions
1. k-NN: (8 marks) In k-nearest neighbors (k-NN), the classification is achieved by majority
vote in the vicinity of data. Given n points, imagine two classes of data each of n/2 points,
which are overlapped to some extent in a 2-dimensional space.
(a) (1 mark) Describe what happens to the training error (using all available data) when
the neighbor size k varies from n to 1.
(b) (2 marks) Predict and explain with a sketch how the generalization error (e.g. holding
out some data for testing) would change when k varies? Explain your reasoning.
1
(c) (2 marks) Give two reasons (with sound justification) why k-NN may be undesirable
when the input dimension is high.
(d) (3 marks) Is it possible to build a univariate decision tree (with decisions at each
node of the form “is x > a”, “is x < b”, “is y > c”, or “is y < d” for any real constants a, b, c, d) which classifies exactly similar to a 1-NN using the Euclidean distance
measure? If so, explain how. If not, explain why not.
2. Bayes Classifier: (6 marks)
(a) (3 marks) A training set consists of one dimensional examples from two classes.
The training examples from class 1 are {0.5, 0.1, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.35, 0.25} and
from class 2 are {0.9, 0.8, 0.75, 1.0}. Fit a (one dimensional) Gaussian using Maximum
Likelihood to each of these two classes. You can assume that the variance for class 1 is
0.0149, and the variance for class 2 is 0.0092. Also estimate the class probabilities p1
and p2 using Maximum Likelihood. What is the probability that the test point x = 0.6
belongs to class 1?
(b) (3 marks) You are now going to make a text classifier. To begin with, you attempt to
classify documents as either sport or politics. You decide to represent each document
as a (row) vector of attributes describing the presence or absence of the following words.
x = (goal,football,golf,defence,offence,wicket,office,strategy)
Training data from sport documents and from politics documents is represented
below in a matrix in which each row represents the 8 attributes.
xpolitics =








1 0 1 1 1 0 1 1
0 0 0 1 0 0 1 1
1 0 0 1 1 0 1 0
0 1 0 0 1 1 0 1
0 0 0 1 1 0 1 1
0 0 0 1 1 0 0 1








xsport =








1 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
1 1 0 1 0 0 0 0
1 1 0 1 0 0 0 1
1 1 0 1 1 0 0 0
0 0 0 1 0 1 0 0








Using a maximum likelihood naive Bayes classifier, what is the probability that the
document x = (1, 0, 0, 1, 1, 1, 1, 0) is about politics?
3. Decision Trees: (16 marks) In this question, you will use the Wine dataset1
, a popular
dataset to evaluate classification algorithms. The classification task is to determine, based
on various parameters, whether a wine quality is over 7. The dataset has already been
preprocessed to convert this into a binary classification problem (scores less than 7 belong to
the “zero” class, and scores greater than or equal to 7 belong to the “one” class). Each line
1https://archive.ics.uci.edu/ml/datasets/Wine+Quality
2 CS 5590
describes a wine, using 12 columns: the first 11 describe the wine’s characteristics (details),
and the last column is a ground truth label for the quality of the wine (0/1). You must not
use the last column as an input feature when you classify the data.
(a) (5 marks) Decision Tree Implementation: Implement your own version of the
decision tree using binary univariate split, entropy and information gain.
(b) (5 marks) Cross-Validation: Evaluate your decision tree using 10-fold cross validation. Please see lecture slides for details. In a nutshell, you will first make a split of the
provided data into 10 parts. Then hold out 1 part as the test set and use the remaining
9 parts for training. Train your decision tree using the training set and use the trained
decision tree to classify entries in the test set. Repeat this process for all 10 parts, so
that each entry will be used as the test set exactly once. To get the final accuracy value,
take the average of the 10 folds’ accuracies. With correct implementation of both parts
(decision tree and cross validation), your classification accuracy should be around 0.78
or higher.
(c) (6 marks) Improvement Strategies: Now, try and improve your decision tree
algorithm. Some things you could do include (not exhaustive):
• Use Gini index instead of entropy
• Use multi-way split (instead of binary split)
• Use multivariate split (instead of univariate)
• Prune the tree after splitting for better generalization
Report your performance as an outcome of ANY TWO improved strategies.
Deliverables:
• Code
• Brief report (PDF) with: (i) Accuracy of your initial implementation; (ii) Accuracy
of your improved implementation, along with your explanation of why the accuracy
improved with this change.
3 CS 5590