Sale!

Stat 01:960:390 Final Project solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (5 votes)

Problem
1. Download Data Set
The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated
data about 14 features for homes from various suburbs in Boston, Massachusetts. Find it here:
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
Note that there are two files: ‘ housing.data ‘ is a data file and ’housing.names’ is a reference
file.
2. Import and Preprocess (10 pts)
(a)Importing data and Naming variables
when you import data into SAS, please read through the section 7. Attribute Information in the
file ‘housing .names’, and name fourteen variables accordingly. The last column should be named
as MEDV, which is the dependent variable in our linear regression analysis. All the rest righthand-sided variables should be named as these items 1-13 in 7. Attribute
Information ,respectively.
(b) Removing Outlier
Based prior experience , in the dataset, there are two types of outliers to remove, which are any
observation of MEDV = 50 or any observation of RM = 8.780.
To be specific, you have to remove any row from the whole data, if the observation of variable
MEDV = 50 or RM = 8.780; Report your data size after your removal of outlier. And explain
why outlier could be harmful to our modeling.
3. Two-Way ANOVA (10 pts)
(a)Does MEDV depend on CHAS , RAD and their interaction?
(b)State these model assumptions and check validity.
4. Linear Regression (20 pts)
(a)Splitting Data into training and test data (5 pts)
• After outlier removal, make the first 70% to be training date and the rest to be test data.
• To be specific, let N be the total number of rows after removal. First you have to find an integer
j* such that j*/N ≈ 70%.
Then, your training data consists of all the rows from the first until j*th row. And your test data
consists of all the remaining rows, i.e. from (j*+1)th row to the last row.
• Report the size of your Training data and Test data.
(b) Full linear Model on Training data (5 pts)
• Estimate your full linear model using the training data.
• Report F-value, Adj R-square, and Root MSE.
• Do Model Checking on residual and multicollinearity.
(c)Best Model on Training data (10 pts)
• Do model selection on linear model, applying these four options for selection = forward,
backward, stepwise and rsquare;
• For each option, report your best model’s variable list (which right-hand-side variables are
included in best model) , F-value, Adj R-square, and Root MSE, compare them with full
model from (b). Note that different model selection criterion may give us identical model.
please report all of them and clearly indicate from which proc options model comes from.
• In terms of minimizing Root MSE, which model is best among all models from last step? If the
full model is not the best, explain the intuition why dropping variables may be beneficial to
reducing training error.
(d) Test Error on Test data
• Report the test Root MSE of your best model from the last step in (c) , compared with that of
the full model from (b). Which Root MSE is larger and why?
• If you have different models from the second step in (c), report the test Root MSE. Is
minimizing Root MSE in (d) give us the same best model as doing it in (c)?
5. Binary Classification using linear regression(10 pts)
In hw3, we have seen an example of binary classification in a non-seperable case. Here we have
another binary classification problem. Let us start with the dataset after removal.
(a)Find MEDV’s median and create a new variable Y: Y = 1 if MEDV >= its median; Y = 0,
otherwise. Throw away MEDV and keeps Y in your datase. Using the same spliting rule (as in
4(a)) to create Training and Test data.
(b)Out of all right-hand-side variables in training data, find two variables X1 and X2 with the
highest and second highest absolute value of correlation coefficient with Y. To calculate the
coefficient, you may treat any string variable as binary or categorical variable.
(c) Plot X1 vs X2 in a 2-dim plot and label each point in the plot by ‘H’ if Y =1 and ‘L’ if Y=0.
Are these two class of points separable by a line? show your plot.
(d)If you run a linear regression of Y on X1 and X2, you may notice already that the predicted
value from linear regression is not discrete-valued. In order to make prediction of Y, you
have to come up with a reasonable rule to assign {0,1} to Y. Please state explicitly what your
rule is. Under your prediction rule, what is the number of classification errors using training
data. And what is the number of classification errors using test data. Report both of them
and explain intuitively why one type of error is larger than the other?