Description
1. [6 points] Prove Bayesβ Theorem. Briefly explain why it is useful for machine learning problems, i.e.,
by converting posterior probability to likelihood and prior probability.
2. [10 points] In Lecture 3-1, we gave the normal equation (i.e., closed-form solution) for linear
regression using MSE as the cost function. Prove that the closed-form solution for Ridge Regression
is π = (ππΌ + π
π
β π)
β1
β π
π
β π, where πΌ is the identity matrix, π = (π₯
(1)
, π₯
(2)
, β¦ , π₯
(π)
)
π
is the input
data matrix, π₯
(π) = (1, π₯1, π₯2, β¦ , π₯π) is the πth data sample, and π¦ = (π¦
(1)
, π¦
(2)
, β¦ , π¦
π). Assume the
hypothesis function βπ€(π₯) = π€0 + π€1π₯1 + π€2π₯2 + β― + π€ππ₯π , and π¦
(π)
is the measurement of
βπ€(π₯) for the πth training sample. The cost function of the Ridge Regression is πΈ(π) = πππΈ(π) +
π
2
β π€π
π 2
π=1
. [Hint: please refer to the proof of the normal equation of linear regression. [ Note: Please
use the following rectified definition of MSE when you prove: πππΈ(π€) = β (ππ
β π
(π) β y
(i)
)
π
π=1
2
. ] .
3. [10 points] Recall the multi-class Softmax Regression model on page 16 of Lecture 3-3. Assume we
have K different classes. The posterior probability is πΜπ = πΏ(π π
(π₯))π =
exp (π π(π₯))
β exp (π π
(π₯))
πΎ
π=1
for π =
1, 2, β¦ ,πΎ, where π π
(π₯) = ππ
π
β π₯, and input π₯ is an n-dimension vector.
1) To learn this Softmax Regression model, how many parameters we need to estimate? What are
these parameters?
2) Consider the cross-entropy cost function π½(π©) (see page 16 of Lecture 3-3) of π training samples
{(π₯π
, π¦π)}π=1,2,β¦,π. Derive the gradient of π½(π©) regarding to ππ as shown in page 17 of Lecture 3-3
Programming Problem:
4. [44 points] In this problem, we write a program to find the coefficients for a linear regression model
for the dataset provided (data2.txt). Assume a linear model: y = w0 + w1*x. You need to
1) Plot the data (i.e., x-axis for 1
st column, y-axis for 2
nd column),
and use Python to implement the following methods to find the coefficients:
2) Normal equation, and
3) Gradient Descent using batch AND stochastic modes respectively:
a) Determine an appropriate termination condition (e.g., when cost function is less than a
threshold, and/or after a given number of iterations).
b) Print the cost function vs. iterations for each mode; compare and discuss batch and
stochastic modes in terms of the accuracy and the speed of convergence.
c) Choose a best learning rate. For example, you can plot cost function vs. learning rate to
determine the best learning rate.
Please implement the algorithms by yoursef and do NOT use the fit() function of the library.