Sale!

EE 660 Homework 2 (Week 3) solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (4 votes)

1. CART for regression
(a) For CART applied to regression, prove the relation for (optimal prediction
value for region ) given below:
[Hint: take the derivative of the cost function in that region, and set equal to 0.]
(b) Also in a regression problem, for a given region containing data
points, and a given feature to threshold , suppose you want to find an optimal
threshold value by trying different values; how many values for need to
be tried? (Give an upper bound.) Justify your answer.
2. Variance in a random forest
Let , be identically distributed random variables, each with mean
and variance . Let the average of these random variables be .
(a) You are given that the are not independent, and have positive pairwise
correlation coefficient , given by
.
Prove that the variance of s is:
(b) Show that always.
(c) For a given B and , what is the smallest variance s could have? What is the
largest variance s could have? What does this imply about desirable properties
of decision trees in a random forest?
wm′

Rm′
wm′
∗ = 1
NRm′
yi
xi∈Rm′

Rm′ NRm′
x j
t
k t
k
vi , i = 1,2,!, B µv
σv
2 s = 1
B
vi
i =1
B

vi
ρ
ρ ! Ε vi
v { j } − µv
2
σv
2 , ρ ≥ 0
σ s
2 = ρσv
2 + (1− ρ)
σv
2
B
ρ ≤1
σv
2
p. 2 of 4
3. Random Forest for yeast data classification
This problem is intended to give some hands on experience using random forest. You
are given a dataset adapted from the Yeast Data Set on UCI repository:
https://archive.ics.uci.edu/ml/datasets/Yeast
containing 1484 data points and 8 features, and has been partitioned into a training
set of 1000 data points, and a testing set of 484 data points. Be sure to use the
datasets posted on the D2L website, as partitioned into training and testing datasets.
The goal is to estimate the Protein Localization Site of each instance. The Protein
Localization Site is a categorical label that takes 10 different values in form of
strings. The provided .csv files are in the usual format for Python use.
(a) Before setting up the random forest classifier, try a couple base (trivial) classifiers for
comparison, as follows. Note that both classifiers output a label prediction
without looking at the input feature values .
(i) A classifier that randomly picks an output label, with probability given by the
label’s frequency of occurrence in the training data set.
For example, let . Then the classifier chooses label c with
probability , for .
Run this classifier 10 times, with different seed each time, and give the resulting
mean percent classification error on the training set and separately on the test
set; also give the standard deviation of the percent classification error on the
training set and on the test set.
(ii) A classifier that always outputs the label of the most populated class (class with
the largest as defined above). Give its percent classification error on the
training set and separately on the test set.
The above classifiers provide examples of systems that haven’t learned anything
from the data features . They can be used as a basis for comparison.
(b) Then try the following experiment setting for random forest:
At each iteration, first create a smaller training set, ‘bag”, randomly drawn from
the given training set. The size of bag means the percent of training samples that
are used to grow a tree (for example, bag_size=1/3 means sampling 1/3 data
from the training set). Use bag_size=1/2. You can use “train_test_split()” or any
other technique for this purpose.
Then train a random forest with the “bag” set and the following parameters:
n_estimators = 1 to B, step size 1, B=30.
criterion = ‘entropy’

i
xi
π c = Nc
(Tr)
N(Tr)
π c c = 1,2,!,10
π c
x
p. 3 of 4 EE 660
max_depth = 5
bootstrap = True
max_features = 3
For each value of number of trees (estimators), repeat the experiment 10 times
(selecting different bag samples every time), and calculate the following results:
• Mean error rate on testing set
• Mean error rate on training set
• Standard deviation of error rate on testing set
• Standard deviation of error rate on training set
Each of the 4 sets of results should be a 30 by 1 vector.
For this problem, please:
(i) Run the above setting, and plot your 4 sets of results against the number of
trees as x-axis (one plot for mean error rates and another plot for std, 2 plots
in total); report the minimum test error rate and its corresponding standard
deviation. (For example, B=30, your ‘mean error rate on testing set’ is a 30
by 1 vector �, and the corresponding std is also a 30 by 1 vector �, you find
that the best testing error rate is at index 28, then report �[28] and �[28].)
Also answer: how do the error rates change as a function of number of trees?
(ii) Now set B=100 and try again. Plot the figures as mentioned in (b)(i). Does the
curve show convergence?
(iii) Try bag_size=’
!
” ,
!
# ,
#

) and max_depth=[1,5, ����] (None means unlimited or
unspecified) and max_features=[1,3,8]. When trying different values for a
specific parameter, you can keep other parameters fixed as the values in (i).
For each setting, draw a set of plots as in (i). So you’ll have 9 sets of plots.
Answer: how does the performance change as you adjust those parameters?
Notes:
(1) if nothing changes significantly when parameter values are changed, that
could be a positive result because it means the classifier is robust (stable) to
changes in the parameter values;
(2) use the standard deviation as a guide to what changes in error rate are
statistically significant (for example, setting P has minimum error rate 0.41
and std 0.02, setting Q has 0.39 and std 0.02, then Q is not necessarily better
than P since the difference is within one std).
(iv) Based on your results in (i) and (iii), choose a best setting. Compare it with the
two trivial classifiers of part (a): has it learned from the data?
p. 4 of 4 EE 660
Hints:
Functions you can use:
sklearn.ensemble.RandomForestClassifier
and its method fit(). For more info, see the example in
http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.ht
ml
You may also find the following functions useful (feel free to use them):
sklearn.metrics.accuracy_score (to measure accuracy on training and test data)
matplotlib.pyplot (to show the results)
4. Forward stagewise additive modeling
(a) Show that forward stagewise additive modeling (Eq. (16.34)) is an adaptive
basis function model (ABM), by applying Eq. (16.34) iteratively, and putting it
in the form of an additive basis function model (Eq. (16.3)); also give the
resulting expressions for , and in terms of the quantities in Eq.
(16.34).
(b) If shrinkage is used in the stagewise additive modeling (Eq. (16.35) instead of
Eq. (16.34)), show that this also be an additive basis function model, and give
expressions for , and in terms of the quantities in Eq. (16.35).
5. Boosting. For the Adaboost.M1 algorithm:
(a) Plot , for .
(b) Let the weight update for data point i at iteration m be represented by:
in which g denotes a gain factor. Plot vs. for , for data
points misclassified by . Also plot (or state the value of, if constant) for
data points correctly classified by .
(c) If we require each base classifier to have a weighted error , then
what is the resulting range of ? Of Of for misclassified points?
w0 ,wm φ m ( x)
w0 ,wm φ m ( x)
α m vs. errm 0 < errm <1 wi ,m+1 = g m wi ,m g m errm 0 < errm <1 φ m g m φ m φ m errm < 0.5 α m β m ? g m