Sale!

ECE M146 Homework 3 solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (4 votes)

1. Consider the modified objective function of regularized logistic regression:
J(w) = −
X
N
n=1
[yn log hw(xn) + (1 − yn) log(1 − hw(xn))] + 1
2
X
i
w
2
i
(1)
where hw(x) = σ(w
T x) and the sigmoid function σ(x) = 1
1+exp(−x)
. Find the partial
derivative ∂J
∂wj
and derive the gradient descent update rules for the weights.
1 ECE M146
2. In class we have seen the probabilistic interpretation of the logistic regression objective, and the derivation of the gradient descent rule for maximizing the conditional
likelihood. We have this maximum likelihood (ML) formulation for w ∈ R
m:
w
∗ = arg max
w
Yn
i=1
P(yi
|xi
, w). (2)
To prevent overfitting, we want the weights to be small. To achieve this, we consider
some prior distribution of the weights f(w) and use maximum a posteriori (MAP)
estimation:
w
∗ = arg max
w
Yn
i=1
P(yi
|xi
, w)f(w). (3)
Assuming a standard Gaussian prior N (0, I) for the weight vector, show that the
MAP estimation is equivalent to minimizing the logistic regression objective with L2
regularization as shown in the previous question. Note: For Z ∼ N (0, Im),
fZ(z) = 1
(2π)
m
2
exp

Xm
i=1
z
2
i
2
!
.
2 ECE M146
3. You are trying to choose between two restaurants (sample 9 and sample 10) to eat
at. To do this, you will use a decision tree that will be trained based on your past
experiences (sample 1-8). The features for each restaurants and your judgment on the
goodness of sample 1-8 are summarized by the following chart.
Sample # HasOutdoorSeating HasBar IsClean HasGoodAtmosphere IsGoodRestaurant
1 0 0 0 1 1
2 1 1 0 0 0
3 0 1 1 1 1
4 1 0 0 1 1
5 1 1 1 0 0
6 1 0 1 0 1
7 1 1 0 1 1
8 0 0 1 1 1
9 0 1 0 1 ?
10 1 1 1 1 ?
(a) What is the entropy of IsGoodRestaurant, i.e., H(IsGoodRestaurant)?
(b) Calculate the conditional entropy of IsGoodRestaurant conditioning on HasOutdoorSeating. To do this, first compute H(IsGoodRestaurant|HasOutdoorSeating =
0) and H(IsGoodRestaurant|HasOutdoorSeating = 1), then weigh each term by
the probabilities P(HasOutdoorSeating = 0) and P(HasOutdoorSeating = 1,
respectively. Namely, calculate the following:
H(IsGoodRestaurant|HasOutdoorSeating)
=P(HasOutdoorSeating = 0)H(IsGoodRestaurant|HasOutdoorSeating = 0)
+P(HasOutdoorSeating = 1)H(IsGoodRestaurant|HasOutdoorSeating = 1).
(c) Similarly, calculate
H(IsGoodRestaurant|X), for X ∈ {HasBar, IsClean, HasGoodAtmosphere},
i.e., the conditional entropy of IsGoodRestaurant conditioning on the other three
features.
(d) Calculate the information gain:
I(IsGoodRestaurant; X) = H(IsGoodRestaurant) − H(IsGoodRestaurant|X),
for
X ∈ {HasOutdoorSeating, HasBar, IsClean, HasGoodAtmosphere}.
(e) Based on the information gain, determine the first attribute to split on.
3 ECE M146
(f) Make the full decision tree. After each split, treat the sets of samples with
X = 0 and X = 1 as two separate sets and redo (b), (c), (d) and (e) on
each of them. X is the feature for previous split and is thus excluded from
the available features which can be split on next. Terminate splitting if after
the previous split, the entropy of IsGoodRetaurant in the current set is 0. For
example, if we choose HasGoodAtmosphere as our first feature to split, we get
H(IsGoodRetaurant|HasGoodAtmosphere = 1) = 0. We thus stop splitting the
tree in this branch. Draw the tree and indicate the split at each node.
(g) Now, determine if restaurants 9 and 10 are good or not.
4 ECE M146
4. When training decision trees for classification, we generally use entropy or gini index
to help determine good splits. These functions are examples of impurity measures. We
can formally define impurity measures as follows.
Assume we are performing binary classification. Let V be a set of data points and let
{yi ∈ {+1, −1} , ∀i ∈ V } be the set of labels. Let V1 ⊆ V . We define p and q as
follows:
p(V1, V ) = |V1|
|V |
q(V1) = |{i : i ∈ V1, yi = 1}|
|V1|
where | · | is the cardinality of a set.
Let i(q(V )) measure the impurity of a set V . Two desired properties of i(q(V )) are:
• i(q(V )) = 0 when the set has only one class
• i(q(V )) reaches its maximum value for sets where the two classes are perfectly
balanced.
When a split is performed of a set V , the result is two sets V1 ∪ V2 = V such that
V1 ∩ V2 = ∅. The information gain of this split is defined as
I(V1, V2, V ) = i(q(V )) − (p(V1, V )i(q(V1)) + p(V2, V )i(q(V2))).
Using the previous notation, we define the binary gini Index and binary entropy as:
gini(q(V )) = 2q(V )(1 − q(V ))
H(q(V )) = −(q(V ) log(q(V )) + (1 − q(V )) log(1 − q(V )))
(a) Like binary entropy, gini index is an alternative impurity measure that is commonly used. Plot both the Gini index and the binary entropy function for q(V )
from 0 to 1. Comment on their similarities.
(b) Show that if i(q(V )) is concave in q(V ) then I(V1, V2, V ) ≥ 0 ∀ V1 ∪ V2 = V, V1 ∩
V2 = ∅. This means that every split does not lose any information. Recall
that a function is concave if ∀λ ∈ [0, 1] and ∀q1, q2 then i(λq1 + (1 − λ)q2) ≥
λi(q1) + (1 − λ)i(q2).
(c) Show that binary entropy is concave. Hint: Show that the 2nd derivative is always
non-positive.
(d) Show that the binary Gini index is concave.

5. Consider the following plots:
(1) Decision Tree Example 1 (2) Decision Tree Example 2
(3) Decision Tree Example 3 (4) Decision Tree Example 4
Assume that you are using a binary split decision tree to classify the points. At each
node, you may ask questions like: “Is f1 ≥ 3?” or “Is f2 ≤ 1.8?”. Here, f1 and f2 are
the variables on the horizontal axis and vertical axis of the examples, respectively.
(a) Which examples can be fully separated using a depth 2 decision tree?
(b) Which example would have the most complex decision tree in terms of the number
of splits? Explain why.
(c) If you used a depth 4 decision tree, is one more example now separable? If so,
show how you can separate it using a depth 4 decision tree.
6
6. On April 15, 1912, the largest passenger liner ever made collided with an iceberg during
her maiden voyage. When the Titanic sank it killed 1502 out of 2224 passengers and
crew. This sensational tragedy shocked the international community and led to better
safety regulations for ships. One of the reasons that the shipwreck resulted in such loss
of life was that there were not enough lifeboats for the passengers and crew. Although
there was some element of luck involved in surviving the sinking, some groups of people
were more likely to survive than others.
In this section, you will use decision trees to predict whether a person survives or
not. You will be using the data set provided in dataTesting X.csv, dataTesting Y.csv,
dataTraining X.csv and dataTraining Y.csv which contains feature values and labels
for both training and testing data. The data is normalized so that all feature values are
between 0 and 1. If you are interested in the actual meaning of each feature, we provide
the original data in titanic.csv where we use the mapping (Female:0, Male:1). (For
python, feel free to use the following libraries: math, collections, numpy, matplotlib
and sklearn.)
(a) Before starting, it is useful to have a baseline accuracy to compare against. A
simple baseline to consider is majority voting where the classifier always outputs
the class with the majority in the training set. Provide the accuracy of this
classifier on both the training and the testing set. To find the accuracy, calculate
the fraction in both the training and the testing set that has the same class as
the majority class in the training set.
(b) Now, use fitctree (sklearn.tree.DecisionTreeClassifier for python user) to
train a decision tree classifier on the training data. Make sure to set the splitting
criteria to “deviance” (“entropy” for python user) to make it use cross entropy.
What is the training and testing accuracy of this model?
7