## Description

## 1 Density Estimation

(a) (10 points) Suppose we have N i.i.d samples x1, x2, · · · , xn. We will practice the maximum

likelihood estimation techniques to estimate the parameters in each of the following cases:

• We assume that all samples can only take value between 0 and 1, and they are generated

from the Beta distribution with parameter α unknown and β = 1. Please show how to

derive the maximum likelihood estimator of α.

• We assume that all samples are generated from Normal distribution N (θ, θ). Please

show how to derive the maximum likelihood estimator of θ.

(b) (10 points) Suppose random variables X1, X2, · · · , Xn are i.i.d sampled according to density

function f(x) and the kernel density estimation is in the form of ˆf(x) = 1

n

Pn

i=1

1

hK(

x−Xi

h

).

Show the bias of the kernel density estimation method by the following steps:

• Show EX1,··· ,Xn

[

ˆf(x)] = 1

h

R

K(

x−t

h

)f(t)dt.

• Use Taylor’s theorem around x on the density f(x − hz) with z =

x−t

h

.

• Compute the bias E[

ˆf(x)] − f(x).

## 2 Nearest Neighbor

(a) (15 points) Suppose we have the locations (coordinates) of 10 USC students during class

time, and we know their majors, as follows:

Mathematics: {(10, 49),(−12, 38),(−9, 47)}

Electrical Engineering: {(29, 19),(32, 31),(37, 38)}

Computer Science: {(8, 9),(30, −28),(−18, −19),(−21, 12)}

• (5 points) Normalize the data, by following the formula on K-Nearest Neighbor lecture

slide, page 53.

• (10 points) Suppose we have a student whose coordinate is at (9, 18) with unknown

major. Using K-Nearest Neighbor with L2 distance metric, predict the student’s major

if we are using K = 1 (2 points) and if we are using K = 3 (2 points).

Similarly, what

are the student’s major predictions if we use L1 distance metric with K = 1 (2 points)

and with K = 3 (2 points). For K = 3, if there is a tie, please choose the label of the

data point with closer distance. Please compare the results between these 4 different

predictions (2 points). Do not forget to normalize/standardize the coordinate of the

student with unknown major using the mean and standard deviation of the students

with known major. Provide intermediate computations of how do you arrive at your

predictions.

(b) (10 points) Suppose now we want to derive a probabilistic K-Nearest Neighbor for classification of an unlabeled data point x, which is D-dimensional. We have a (multi-dimensional)

D-sphere with center at x, allowing its radius to grow until it precisely contains K labeled

data points, irrespective of their class. At this size, the volume of the sphere is V . Let

there be a total of N labeled data points in the entire space (both inside and outside of the

1

CSCI567 Homework #1 Due 09/23/15 23:59 PDT

sphere), with Nc data points labeled as class c, such that P

c Nc = N. Also, a subset of the

K data points inside of the sphere belongs to class c, there are Kc of them in total. We model

estimated density associated with each class as p (x | Y = c) = Kc

NcV

and the class prior as

p (Y = c) = Nc

N

.

1

• (5 points) Using the fact that P

c Kc = K, derive the formula for unconditional density

p (x).

• (5 points) Using Bayes rule, derive the formula for the posterior probability of class

membership p (Y = c | x).

## 3 Decision Tree

(a) (10 points) Suppose we did 80 observations resulting in the following tabulated data with

binary features, such as temperature, humidity, and sky condition, and we observe the occurrence of the rainy day, denoted by total rainy days per total observations (#Rainy/#Observations)

for each combination of features. Using this data, we want to grow a decision tree which

maximizes information gain, to predict the future occurrence of rainy days. These are some

details:

• To avoid overfitting, please limit the decision tree’s maximum depth to 3, with the leaf

nodes are binaries (i.e. Rainy or not).

• For internal nodes, if there is a tie in information gain, please select based on the following

order: T emperature >> Humidity >> SkyCondition (for example if you have a tie

between T emperature and Humidity, then you will select T emperature as the predictor

feature).

• Decision at leaf nodes (either Rainy or not) will be based on the dominant observation;

in case of a tie, Rainy will be selected.

Please provide the intermediate computations (4 points), the predictor variable/feature that

you select for the split in each node of the tree based on information gain criteria (4 points),

and draw the resulting decision tree (2 points).

Temperature Humidity Sky Condition #Rainy/#Observations

Hot High Cloudy 9/10

Hot High Clear 5/10

Hot Low Cloudy 6/10

Hot Low Clear 3/10

Cool High Cloudy 7/10

Cool High Clear 2/10

Cool Low Cloudy 3/10

Cool Low Clear 1/10

1This problem is a rephrase of some materials from ”Pattern Recognition and Machine Learning” book by Christopher M. Bishop. However, even without having/reading the book, this problem should be fairly easy to solve, just

by following the description in this problem set.

2

CSCI567 Homework #1 Due 09/23/15 23:59 PDT

(b) (10 points) In training decision trees, the ultimate goal is to minimize the classification

error. However, the classifaction error is not a smooth function; thus, several surrogate loss

functions have been proposed. Two of the most common loss functions are the Gini index

and Cross-entropy, see [MLaPP, Section 16.2.2.2] or [ESL, Section 9.2.3] for the definitions.

Prove that, for any discrete probability distribution p with K classes, the value of the Gini

index is less than or equal to the corresponding value of the cross-entropy. This implies that

the Gini index is a better approximation of the misclassification error.

Definitions: For a K-valued discrete random variable with probability mass function pi

, i =

1, . . . , K the Gini index is defined as: PK

k=1 pk(1 − pk) and the cross-entropy is defined as

−

PK

k=1 pk log pk.

(c) (5 points) Suppose we have continuous attributes X1 and X2, and we have collected a labeled

dataset having these attributes with two class labels Y = c (depicted as red circles) and Y = t

(depicted as green triangles). We want to use decision trees for classification on this dataset.

The test for split at each internal node is using inequalities of the form Xi > s, and each

node is evaluated as either true or f alse. s is called a split point, which is any real number

chosen by the learning algorithm. In this decision tree, each attribute is allowed to be tested

any number of times on any path in the tree.

Consider the four cases depicted below (a), (b),

(c), and (d). Which of these dataset can be classified correctly with a tree depth less than 6?

(hint: draw the decision boundaries on each cases)

X1

X2

X1 X1 X1

X2 X2 X2

(a) (b) (c) (d)

## 4 Naive Bayes

(a) (15 points) Suppose a training set with N examples (X1, Y1),(X2, Y2), · · · ,(XN , YN ) is given,

where Xi = (xi1, · · · , xiD) ∈ R

D is a D-dimensional feature vector, and Yi ∈ {1, 2, · · · , K} is

its corresponding label. Given the following assumptions:

• The label variable Y follows a categorical distribution, with parameter pk = P(Y = k),

for k = 1, 2, · · · , K.

• For each xj of the D features, we have that P(xj |Y = yk) follows a Gaussian distribution

of the form N (µjk, σjk).

• Naive Bayes assumption: for all j

0 6= j, xj and xj

0 are conditionally independent given

Y .

Please provide the maximum likelihood estimation for the parameters of the Naive Bayes

with Gaussian assumption. In other words, you need to provide the estimates for pk, µjk,

and σjk, for j = 1, . . . , D and k ∈ {1, 2, · · · , K}.

3

CSCI567 Homework #1 Due 09/23/15 23:59 PDT

(b) (15 points) As a TA, you want to know if one student understand the knowledge covered

in Machine Learning class or not, given his/her answers of a quiz with D multiple choice

problems.

A binary label variable Y is used to indicate whether the student understand enough machine

learning knowledge(Y = 1) or not(Y = 0), where Y follows a Bernoulli distribution with

parameter π = P(Y = 1). For each problem, a binary feature variable Xj denotes whether

the answer to question j is correct(Xj = 1) or not(Xj = 0), where each feature Xj follows a

Bernoulli distribution given the label P(Xj |Y = yk) = θ

Xj

jk (1 − θjk)

1−Xj

.

Given a D-dimensional binary vector X = {X1, . . . , XD} and a label variable Y , you want to

know whether the student is good not not under Naive Bayes assumption.

Please show that P(Y |X) has the form as follows:

P(Y = 1|X) = 1

1 + exp(−w0 + w>X)

Specifically, you need to find the explicit form of w0 and w in terms of π and θjk, for j =

1, . . . , D and k ∈ {0, 1}.

4

CSCI567 Homework #1 Due 09/23/15 23:59 PDT

### 5 Programming

### 5.1 Density Estimation

In this problem, you will write a MATLAB program to fit a density estimator using 1) Gaussian

kernel, 2) Epanechnikov kernel2

, and 3) histogram. You are NOT allowed to use any related

MATLAB toolbox functions for density estimations.

You can get newest version of Matlab from http://itservices.usc.edu/matlab/.

The file hw1progde.mat contains two sets of data points, x tr and x te, which are i.i.d generated

from the same unknown univariate probability density of the unit interval, i.e., between 0 and 1.

(a) Fit the estimators at a variety of values of the bandwidth h on x tr with at least 5 different h.

You can select the values of bandwidth manually or by some methods such as cross validation.

(b) Randomly divide the test data x te into N = 19 subsets.

Each of the subset contains 500

points. You need to carry out kernel density estimation for each subset. You should use the

same values of h from (a). You are not supposed to get the true distribution of x, but you

can still compute the integrated variance of each of the three estimators, which is defined as

R 1

0

1

N

P

n

ˆfh,n(x) − ˆfh(x)

2

dx, where ˆfh,n(x) is estimators of n-th subset, and ˆfh(x) is the

empirical mean of all these N estimators. You should approximate the integrals by summing

over 50 evenly spaced points in the unit interval. Plot the integrated variance as a function

of h.

(c) Compare the relative performance of the three methods based on your results.

### 5.2 Classification

In this assignment, you will experiment with three classification algorithms on real-world datasets.

You will use MATLAB’s functions to experiment with Decision Tree, but you need to implement

Naive Bayes and K-Nearest Neighbor (K-NN) algorithms by yourself. You are NOT allowed

to use any related MATLAB toolbox functions for Naive Bayes and K-NN (e.g. knnclassify,

knnsearch). Below, we describe the steps that you need to take to accomplish this programming

assignment.

Dataset: We will use the Tic-Tac-Toe Endgame Dataset (for all classification algorithms) and the

Nursery Dataset (ONLY for Naive Bayes) from UCI’s machine learning data repository. The training/validation/test sets are provided along with the assignment in Blackboard as hw1ttt train.data,

hw1ttt valid.data, and hw1ttt test.data for the Tic-Tac-Toe Endgame Dataset, and

hw1nursery train.data, hw1nursery valid.data, and hw1nursery test.data for the Nursery

Dataset.

For description of the datasets, please refer to https://archive.ics.uci.edu/ml/

datasets/Tic-Tac-Toe+Endgame and https://archive.ics.uci.edu/ml/datasets/Nursery.

Please follow the steps below:

(a) Data Pre-processing The first step in every data analysis experiment is to inspect the

datasets and make sure that the data has the appropriate format. You will find that the

features in the provided dataset are categorical. However, some of the algorithms require the

2

https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use

5

CSCI567 Homework #1 Due 09/23/15 23:59 PDT

features to be real-valued numbers. To convert a categorical feature with K categories to

real-valued number, you can create K new binary features. The ith binary feature indicates

whether the original feature belongs to the ith category or not.

For example, if you have

a feature called ’height’ with possible values ’low’, ’medium’, and ’high’, then you will have

three (3) binary features: if in a single data instance, you have a value of ’low’, then your

binary features will be 1-0-0; if value is ’medium’, then your binary features will be 0-1-0; if

value is ’high’, then your binary features will be 0-0-1.

About the training labels, if there are

M labels in the dataset, then convert each of them into a number between 1,…,M. For example

if there are 3 possible labels: ’positive’, ’neutral’, and ’negative’, then ’positive’ label will be

converted to 1, ’neutral’ label will be converted to 2, and ’negative’ label will be converted to

3.

So, for a training dataset (* train.data) of size N, then train data will be a matrix of

size N × D (with D is the total number of binary features, as a result of this pre-processing)

and train label will be a column vector of dimension N × 1. This also applies to validation

dataset (* valid.data) and testing dataset (* test.data). This pre-processed matrices and

vectors will be an input to part 5.2(b), 5.2(c), and 5.2(d).

Hint: One possible way of doing this pre-processing is by using MATLAB’s Map Containers

(http://www.mathworks.com/help/matlab/map-containers.html).

(b) Implement Naive Bayes Please fill in the function naive bayes in naive bayes.m file (in

Blackboard). The inputs of this function are training data and new data (either validation or

testing data).

The function needs to output the accuracy on both training and new data (either validation or testing). Note that some feature values might exist in the validation/testing

data, but do not exist in the training data. In that case, please set the probability of that

feature value to a small value, for example, 0.1. Since you will use Naive Bayes to operate

on 2 datasets later on, it is strongly recommended that you write an implementation of the

algorithm which is general-purpose, as less dependent on the dataset (size, etc.) as possible.

(c) Implement K-NN Please fill in the function knn classify in knn classify.m file (in Blackboard). The inputs of this function are training data, new data (either validation or testing

data) and K. Use the L2 norm distance in the K-NN algorithm. The function needs to

output the accuracy on both training and new data (either validation or testing). Do not

forget to standardize the data beforehand, similar to Problem 2(a), following the formula on

K-NN lecture slide, page 53.

(d) Performance Comparison Compare the three algorithms (K-NN, Naive Bayes, and Decision Tree) on the provided dataset.

K-NN: Consider K = 1, 3, 5, · · · , 15. For each K, report the training, validation and test

accuracy. When computing the training accuracy of K-NN, we use leave-one-out strategy, i.e. classifying each training point using the remaining training points. Note that we

use this strategy only for K-NN in this assignment. Operate ONLY on the Tic-Tac-Toe

Endgame Dataset.

Decision Tree: Train decision trees using function ClassificationTree.fit or fitctree

in Matlab. Report the training, validation and test accuracy for different split criterions

(Gini index and cross-entropy, using the SplitCriterion attribute), by setting the minimum number of leaf node observations to 1, 2, · · · , 10 (using the MinLeaf attribute). So

6

CSCI567 Homework #1

basically, you need to report the results for 2 × 10 = 20 different cases. When training

decision trees, please turn off pruning using the Prune attribute. Operate ONLY on the

Tic-Tac-Toe Endgame Dataset.

Naive Bayes: Report and compare the training, validation, and test accuracy, both on

the Tic-Tac-Toe Endgame Dataset and the Nursery Dataset. If there is a significant

difference between the two datasets’ classification accuracy, could you guess why?

(e) Decision Boundary In this step, you need to apply K-NN on the hw1boundary.mat dataset

(provided in Blackboard) which is a binary classification dataset with only two features. You

need to run K-NN with K = 1, 5, 15, 25 and examine the decision boundary. A simple way

to visualize the decision boundary, is to draw 10000 data points on a uniform 100 × 100 grid

in the square (x, y) ∈ [0, 1] × [0, 1] and classify them using the K-NN classifier.

Then, plot

the data points with different markers corresponding to different classes. Repeat this process

for all k and discuss the smoothness of the decision boundaries as K increases.

7

CSCI567 Homework #1

### Submission Instruction: You need to provide the followings:

• Provide your answers to problems 1-4, 5.1, 5.2(d), and 5.2(e) in PDF file, named as

CSCI567 hw1 fall15.pdf. You need to submit the homework in both hard copy (at CS

Front Desk with a box labeled as CSCI567-homework by 5pm of the deadline date) and

electronic version as pdf file on Blackboard. If you choose handwriting instead of typing

all the answers, you will get 40% points deducted.

• Submit ALL the code and report via Blackboard. The only acceptable language is MATLAB. For your program, you MUST include the main function called CSCI567 hw1 fall15.m

in the root of your folder. After running this main file, your program should be able

to generate all of the results needed for this programming assignment, either as plots

or console outputs.

You can have multiple files (i.e your sub-functions), however, the

only requirement is that once we unzip your folder and execute your main file, your

program should execute correctly. Please double-check your program before submitting.

You should only submit one .zip file. No other formats are allowed except .zip file.

Also, please name it as [lastname] [firstname] hw1 fall15.zip.

Collaboration: You may collaborate. However, collaboration has to be limited to discussion only

and you need to write your own solution and submit separately. You also need to list with

whom you have discussed.

8