Description
1. (20 points) Linear Regression
In this problem, we will implement least squares linear regression to predict density of wine based on its
acidity. Recall that the error metric for least squares is given by:
J(θ) = 1
2m
Xm
i=1
(y
(i) − hθ(x
(i)
))2
where hθ(x) = θ
T x and all the symbols are as discussed in the class. The files linearX.csv and linearY.csv
contain the acidity of the wine (x
(i)
’s, x
(i) ∈ R) and its density (y
(i)
’s, y
(i) ∈ R), respectively, with
one training example per row. We will implement least squares linear regression to learn the relationship
between x
(i)
’s and y
(i)
’s.
(a) (8 points) Implement batch gradient descent method for optimizing J(θ). Choose an appropriate
learning rate and the stopping criteria (as a function of the change in the value of J(θ)). You can
initialize the parameters as θ = ~0 (the vector of all zeros). Do not forget to include the intercept
term. Report your learning rate, stopping criteria and the final set of parameters obtained by your
algorithm.
1
(b) (3 points) Plot the data on a two-dimensional graph and plot the hypothesis function learned by your
algorithm in the previous part.
(c) (3 points) Draw a 3-dimensional mesh showing the error function (J(θ)) on z-axis and the parameters
in the x − y plane. Display the error value using the current set of parameters at each iteration of
the gradient descent. Include a time gap of 0.2 seconds in your display for each iteration so that the
change in the function value can be observed by the human eye.
(d) (3 points) Repeat the part above for drawing the contours of the error function at each iteration of
the gradient descent. Once again, chose a time gap of 0.2 seconds so that the change be perceived by
the human eye.(Note here plot will be 2-D)
(e) (3 points) Repeat the part above (i.e. draw the contours at each learning iteration) for the step size
values of η = {0.001, 0.025, 0.1}. What do you observe? Comment.
2. (20 points) Sampling and Stochastic Gradient Descent
In this problem, we will introduce the idea of sampling by adding Gaussian noise to the prediction of a
hypothesis and generate synthetic training data. Consider a given hypothesis hθ (i.e. known θ0, θ1, θ2) for
a data point x =
”
x0
x1
x2
#
. Note that x0 = 1 is the intercept term.
y = hθ(x) = θ0 + θ1×1 + θ2×2
Adding Gaussian noise, equation becomes
y = θ0 + θ1×1 + θ2×2 +
where ∼ N (0, σ2
)
To gain deeper understanding behind Stochastic Gradient Descent (SGD), we will use the SGD algorithm
to learn the original hypothesis from the data generated using sampling, for varying batch sizes. We will
implement the version where we make a complete pass through the data in a round robin fashion (after
initially shuffling the examples). If there are r examples in each batch, then there is a total of m
r
batches
assuming m training examples. For the batch number b (1 ≤ b ≤
m
r
), the set of examples is given as:
{x
(i1)
, x(i2)
, · · · , x(ir)} where ik = (b − 1)r + k. The Loss function computed over these r examples is given
as:
Jb(θ) = 1
2k
Xr
k=1
(y
(ik) − hθ(x
(ik)
))2
(a) (4 points) Sample 1 million data points taking values of θ =
”
θ0
θ1
θ2
#
=
”
3
1
2
#
, x1 ∼ N (3, 4) and x2 ∼
N (−1, 4) independently, and noise variance in y, σ
2 = 2.
(b) (6 points) Implement Stochastic gradient descent method for optimizing J(θ). Relearn θ =
”
θ0
θ1
θ2
#
using sampled data points of part a) keeping everything same except the batch size. Keep η = 0.001
and initialize ∀j θj = 0. Report the θ learned each time separately for values of batch size r =
{1, 100, 10000, 1000000}. Carefully decide your convergence criteria in each case. Make sure to watch
the online video posted on the course website for deciding the convergence of SGD algorithm.
(c) (6 points) Do different algorithms in the part above (for varying values of r) converge to the same
parameter values? How much different are these from the parameters of the original hypothesis from
which the data was generated? Comment on the relative speed of convergence and also on number
of iterations in each case. Next, for each of learned models above, report the error on a new test
data of 10,000 samples provided in the file named q2test.csv. Note that this test set was generated
using the same sampling procedure as described in part (a) above. Also, compute the test error with
respect to the prediction of the original hypothesis, and compare with the error obtained using learned
hypothesis in each case. Comment.
(d) (4 points) In the 3 dimensional parameter space(θj on each axis), plot the movement of θ as the
parameters are updated (until convergence) for varying batch sizes. How does the (shape of) movement
compare in each case? Does it make intuitive sense? Argue.
2
3. (15 points) Logistic Regression
Consider the log-likelihood function for logistic regression:
L(θ) = Xm
i=1
y
(i)
log hθ(x
(i)
) + (1 − y
(i)
) log(1 − hθ(x
(i)
))
For the following, you will need to calculate the value of the Hessian H of the above function.
(a) (10 points) The files logisticX.csv and logisticY.csv contain the inputs (x
(i) ∈ R2
) and outputs
(y
(i) ∈ {0, 1}) respectively for a binary classification problem, with one training example per row.
Implement1 Newton’s method for optimizing L(θ), and apply it to fit a logistic regression model to
the data. Initialize Newton’s method with θ = ~0 (the vector of all zeros). What are the coefficients θ
resulting from your fit? (Remember to include the intercept term.)
(b) (5 points) Plot the training data (your axes should be x1 and x2 , corresponding to the two coordinates
of the inputs, and you should use a different symbol for each point plotted to indicate whether that
example had label 1 or 0). Also plot on the same figure the decision boundary fit by logistic regression.
(i.e., this should be a straight line showing the boundary separating the region where h(x) > 0.5 from
where h(x) ≤ 0.5.)
4. (25 points) Gaussian Discrmimant Analysis
In this problem, we will implement GDA for separating out salmons from Alaska and Canada. Each salmon
is represented by two attributes x1 and x2 depicting growth ring diameters in 1) fresh water, 2) marine
water, respectively. File q4x.dat stores the two attribute values with one entry on each row. File q4y.dat
contains the target values (y
(i)
’s ∈ {Alaska, Canada}) on respective rows.
(a) (6 points) Implement Gaussian Discriminant Analysis using the closed form equations described in
class. Assume that both the classes have the same co-variance matrix i.e. Σ0 = Σ1 = Σ. Report the
values of the means, µ0 and µ1, and the co-variance matrix Σ.
(b) (2 points) Plot the training data corresponding to the two coordinates of the input features, and
you should use a different symbol for each point plotted to indicate whether that example had label
Canada or Alaska.
(c) (3 points) Describe the equation of the boundary separating the two regions in terms of the parameters
µ0, µ1 and Σ. Recall that GDA results in a linear separator when the two classes have identical covariance matrix. Along with the data points plotted in the part above, plot (on the same figure)
decision boundary fit by GDA.
(d) (6 points) In general, GDA allows each of the target classes to have its own covariance matrix. This
results (in general) results in a quadratic boundary separating the two class regions. In this case, the
maximum-likelihood estimate of the co-variance matrix Σ0 can be derived using the equation:
Σ0 =
Pm
i=1
1{y
(i) = 0}(x
(i) − µy(i) )(x
(i) − µy(i) )
T
Pm
i=1
1{y
(i) = 0}
(1)
And similarly, for Σ1. The expressions for the means remain the same as before. Implement GDA for
the above problem in this more general setting. Report the values of the parameter estimates i.e. µ0,
µ1, Σ0, Σ1.
(e) (5 points) Describe the equation for the quadratic boundary separating the two regions in terms of
the parameters µ0, µ1 and Σ0, Σ1. On the graph plotted earlier displaying the data points and the
linear separating boundary, also plot the quadratic boundary obtained in the previous step.
(f) (3 points) Carefully analyze the linear as well as the quadratic boundaries obtained. Comment on
your observations.
1Write your own version, and do not call a built-in library function.
3