Sale!

EL-GY-9123 Homework 3b Cross Validation and Feature Selection Solved

Original price was: $40.00.Current price is: $35.00. $29.75

Category:

Description

5/5 - (1 vote)

Cross Validation and Feature Selection

1. Suppose you are given a dataset with N samples (each with a feature vector xn and an
observed target value yn).

Based on this dataset, you are tasked to design a multi-linear
regression function that can be used to predict the target value y for any new sample with a
feature vector x.

Furthermore, you should report the expected prediction error (mean square
error between the predicted value and the true (but unknown) target value for all possible
new samples.)

The following are several options:

(a) Use all N samples to determine the optimal linear regressor that will minimize the mean
square prediction error for these N samples.

Furthermore, calculate the mean square
error between the predicted values and the true values among these samples.

(b) Divide the N samples to two halves, train your linear regressor on one half (training
set), and then apply the trained regressor on the samples in the other half (validation
set), and evaluate the mean square error for the validation set.

(c) Run a K-fold cross validation, to generate K regressors, and determine the mean square
error for the validation set in each fold.

Finally determine the average of the mean
square errors obtained by the K validation sets in the K-folds.

What may be the problem with each approach? Which method would you use? Your answer
should consider two cases: when N is very large, and when N is relatively small.

Also, with
the cross validation approach, how would you use the K different regressors developed to
predict a new sample?

2. Suppose you used K fold cross validation to generate K linear regressors denoted by β
k
, k =
1, 2, . . . , K, with β
k = [β
k
0
, βk
1

, . . . , βk
J

]. To predict the target value for a new sample with
feature x = [x1, x2, . . . , xJ ], you could apply each predictor β
k on x to generate K predictions,
y
k

, k = 1, 2, . . . , K, and then average these predictions to obtain your final prediction. Show
that this is equivalent to derive an average predictor β¯ = [β¯
j , j = 0, 1, . . . , J], with β¯
j =
(
PK
k=1 β
k
j
)/K and apply this average predictor to the sample.

3. Consider the development of a linear regressor from a training data again. Suppose the
feature vector contains many features and you know that only a subset of them are helpful
for predicting the target variable, but you do not know how many features should be included
and what they are.

One way to do feature selection is by using LASSO regression. Describe
how would you go about determine the optimal subset.

Consider two cases: when N is very
large, and when N is relatively small.

4. Continue with the previous problem. Instead of using the LASSO method, list some other
methods that you may use for feature selection.

5. Given raw data samples x
r
i,j , yr
i

, we often perform data normalization so that normalized
features xi,j = (x
r
i,j − x¯j )/σj and target yi = (y
r
i − y¯)/σy.

Here ¯xj and σj denote the mean
and standard deviation for feature j, and ¯y and σy denote the mean and the standard deviation
for the target, all computed from the given data samples.

(a) Show that the normalized features and target each have zero mean and unit variance.

(b) Suppose you would like to predict the normalized target y from the normalized features
x1, x2, . . . , xJ using a linear regressor ˆy = β0 + β1×1 + . . . + βJ xJ .

You will use the
normalized training data to derive the regression coefficients βj , = 0, 1, . . . , J. Show
that the optimal intercept term should be zero, i.e., β0 = 0.

(c) Let the regression coefficients determined for the normalized data be β = [β1, β2, . . . , βJ ].
Describe how do you apply β to any given raw test sample with features x
r = [x
r
1
, xr
2
, . . . , xr
J
].

What are the equivalent regression coefficients β
r = [β
r
0
, βr
1
, . . . , βr
J
] for the raw data?

6. Why is data normalization important with ridge regression and LASSO regression?

7. What are the difference between ridge regression and LASSO regression? What are their pros
and cons?

8. Ridge Regression is a linear predictor that minimizes the following loss function
J(β) = kAβ − yk
2 + αkβk
2

Show that the optimal solution is
βopt = (A
T A + αI)
−1A
T y

9. Instead of using either ridge or LASSO loss function, one can develop a linear regressor by
minimizing the following loss function (known as elastic-net):
J(β) = ky − Aβk
2 + α(λkβk
2 + (1 − λ)kβk1)

Show how can you turn this into a LASSO problem, using an augmented version of A and y.
2