Sale!

ECE M146 Homework 5 solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (4 votes)

1. Show that a kernel function K(x1, x2) satisfies the following generalization of the
Cauchy-Schwartz inequality:
K(x1, x2)
2 ≤ K(x1, x1)K(x2, x2).
Hint: The Cauchy-Schwartz inequality states that: for two vectors u and v, |u
T
v|
2 ≤
kuk
2kvk
2
.
1
2. Given valid kernels K1(x, x0
) and K2(x, x0
), show that the following kernels are also
valid:
(a) K(x, x0
) = K1(x, x0
) + K2(x, x0
).
(b) K(x, x0
) = K1(x, x0
)K2(x, x0
).
(c) K(x, x0
) = exp(K1(x, x0
)). Hint: use your results in (a) and (b).
2
3. In class, we learned that the soft margin SVM has the primal problem:
min
ξ,w,b
1
2
kwk
2 + C
Xm
i=1
ξi
s.t. y(i)
(w
T x
(i) + b) ≥ 1 − ξi
, i = 1, · · · , m
ξi ≥ 0, i = 1, · · · , m,
and the dual problem:
max
α
W(α) = Xm
i=1
αi −
1
2
Xm
i,j=1
y
(i)
y
(j)αiαj hx
(i)
, x(j)
i
s.t. 0 ≤ αi ≤ C, i = 1, · · · , m,
Xm
i=1
αiy
(i) = 0.
Note that hz, si is an alternative expression for the inner product z
T
s. As usual,
y
(i) ∈ {+1, −1}.
Now suppose we have solved the dual problem and have the optimal α. Show that the
parameter b can be determined using the following equation:
b =
1
NM
X
n∈M
y
(n) −
X
m∈S
αmy
(m)
hx
(n)
, x(m)
i
!
. (1)
In (1), M denotes the set of indices of data points having 0 < αn < C, parameter NM
denotes the size of the set M, and S denotes the set of indices of data points having
αn 6= 0.
3
4. Consider 3 random variables A,B and C with joint probabilities P(A, B, C) listed in
the following table.
C=0 C=1
B=0 B=1 B=0 B=1
A=0 0.096 0.024 0.27 0.03
A=1 0.224 0.056 0.27 0.03
(a) Calculate P(A|C = 0), P(B|C = 0), and P(A, B|C = 0).
(b) Calculate P(A|C = 1), P(B|C = 1), and P(A, B|C = 1).
(c) Is A conditionally independent of B given C?
(d) Calculate P(A), P(B), and P(A, B).
(e) Is A independent of B?
4
5. Let us revisit the restaurant selection problem in HW3. You are trying to choose
between two restaurants (sample 9 and sample 10) to eat at. To do this, you will
train a classifier based on your past experiences (sample 1-8). The features for each
restaurants and your judgment on the goodness of sample 1-8 are summarized by the
following chart. In this exercise, instead of a decision tree, you will use the Na¨ıve
Sample # HasOutdoorSeating HasBar IsClean HasGoodAtmosphere IsGoodRestaurant
1 0 0 0 1 1
2 1 1 0 0 0
3 0 1 1 1 1
4 1 0 0 1 1
5 1 1 1 0 0
6 1 0 1 0 1
7 1 1 0 1 1
8 0 0 1 1 1
9 0 1 0 1 ?
10 1 1 1 1 ?
Bayes classifier to decide whether restaurant 9 and 10 are good or not. For clarity, we
abbreviate the names of the features and label as follows: HasOutdoorSeating → O,
HasBar → B, IsClean → C, HasGoodAtmosphere → A, and IsGoodRestaurant → G.
(a) Train the Na¨ıve Bayes classifier by calculating the maximum likelihood estimate
of class priors and class conditional distributions. Namely, calculate the maximum
likelihood estimate of the following: P(G), and P(X|G), X ∈ {O, B, C, A}.
(b) For Sample #9 and #10, make the decision using

i = argmax
Gi∈{0,1}
P(Gi)P(Oi
, Bi
, Ci
, Ai
|Gi),
where Oi
, Bi
, Ci
, and Ai are the feature values for the i-th sample.
(c) We use Laplace smoothing to avoid having class conditional probabilities that are
strictly 0. To use Laplace smoothing for a binary classifier, add 1 to the numerator
and add 2 to the denominator when calculating the class conditional distributions.
Let us re-calculate the class conditional distributions with Laplace smoothing.
Namely, calculate the maximum likelihood estimate of P(X|G), X ∈ {O, B, C, A}.
(d) Repeat (b) with the class conditional distributions you get from (c).
5
6. In class, we learned a Na¨ıve Bayes classifier for binary feature values, i.e., xj ∈ {0, 1}
where we model the class conditional distribution to be Bernoulli. In this exercise, you
are going to extend the result to the case where features that are non-binary.
We are given a training set {(x
(i)
, y(i)
); i = {1, · · · , m}}, where x
(i) ∈ {1, 2, · · · , s}
n
and y
(i) ∈ {0, 1}. Again, we model the label as a biased coin with θ0 = P(y
(i) = 0)
and 1 − θ0 = P(y
(i) = 1). We model each non-binary feature value x
(i)
j
(an element of
x
(i)
) as a biased dice for each class. This is parameterized by:
P(xj = k|y = 0) = θj,k|y=0, k = 1, · · · , s − 1;
P(xj = s|y = 0) = θj,s|y=0 = 1 −
Xs−1
k=1
θj,k|y=0;
P(xj = k|y = 1) = θj,k|y=1, k = 1, · · · , s − 1;
P(xj = s|y = 1) = θj,s|y=1 = 1 −
Xs−1
k=1
θj,k|y=1;
Notice that we do not model P(xj = s|y = 0) and P(xj = s|y = 1) directly. Instead
we use the above equations to guarantee all probabilities for each class sum to 1.
(a) Using the Na¨ıve Bayes (NB) assumption, write down the joint probability of
the data:
P(x
(1)
, · · · , x(m)
, y(1)
, · · · , y(m)
)
in terms of the parameters θ0, θj,k|y=0 and θj,k|y=1. You may find the indicator
function 1(·) useful.
(b) Now, maximize the joint probability you get in (a) with respect to each of θ0,
θj,k|y=0, and θj,k|y=1. Write down your resulting θ0, θj,k|y=0 and θj,k|y=1 and show
intermediate steps. Explain in words the meaning of your results.
6