Description
CptS 475/575 Assignment 5 Part 1: Linear Regression & Logistic Regression Solved
1) (18 points) This question involves the use of multiple linear regression on the redwine
(winequality-red.csv) data set available on Canvas in the Datasets for Assignments module.
This is the same dataset used in Assignment 2.
a. (6 points) Perform a multiple linear regression with pH as the response and all other
variables except citric_acid as the predictors. Show a printout of the result (including
coefficient, error, and t-values for each predictor). Comment on the output by answering
the following questions:
i) Which predictors appear to have a statistically significant relationship to the response?
How do you determine this?
ii) What does the coefficient for the free_sulfur_dioxide variable suggest, in simple terms?
b. (6 points) Produce diagnostic plots of the linear regression fit. Comment on any problems
you see with the fit. Do the residual plots suggest any unusually large outliers? Does the
leverage plot identify any observations with unusually high leverage?
c. (6 points) Fit at least 3 linear regression models (exploring interaction effects) with alcohol
as the response and some combination of other variables as predictors. Do any interactions
appear to be statistically significant?
2) (30 points) This problem involves the Boston data set, which can be loaded from library MASS
in R and is also made available in the Datasets for Assignments module on Canvas
(boston.csv). We will now try to predict per capita crime rate (crim) using the other variables
in this data set. In other words, per capita crime rate is the response, and the other variables are
the predictors.
a. (6 points) For each predictor, fit a simple linear regression model to predict the response.
Include the code, but not the output for all the models in your solution.
b. (6 points) In which of the models is there a statistically significant association between the
predictor and the response? Considering the meaning of each variable, discuss the
relationship between crim and each of the predictors nox, chas, rm, dis and medv. How do
these relationships differ?
c. (6 points) Fit a multiple regression model to predict the response using all the predictors.
Describe your results. For which predictors can we reject the null hypothesis H0 : βj = 0?
d. (6 points) How do your results from (a) compare to your results from (c)? You can present
this comparison as a plot or as a table or any other form of comparison you deem fit.
2
e. (6 points) Is there evidence of non-linear association between the predictors age and tax
and the response crim? To answer this question, for each predictor (age and tax), fit a model
of the form:
Y = β0 + β1X + β2X
2 + β3X
3+ ε
Hint: use the poly() function in R. Use the model to assess the extent of non-linear
association.
3) (12 points) Suppose we collect data for a group of students in a statistics class with variables:
X1 = hours studied,
X2 = undergrad GPA,
X3 = PSQI score (a sleep quality index), and
Y = receive an A.
We fit a logistic regression and produce estimated coefficient, β0 = −8, β1 = 0.1, β2 = 1, β3 = -.04.
a. (4 points) Estimate the probability that a student who studies for 32 h, has a PSQI score of
11 and has an undergrad GPA of 3.0 gets an A in the class. Show your work.
b. (4 points) How many hours would the student in part (a) need to study to have a 65 %
chance of getting an A in the class? Show your work.
c. (4 points) How many hours would a student with a 3.0 GPA and a PSQI score of 3 need to
study to have a 60 % chance of getting an A in the class? Show your work.
CptS 475/575 Assignment 5 – Part 2: Classification Solved
1. (20 points) Tokenization
In order to classify the text effectively, you will need to split your text into tokens. It is
common practice when doing this to reduce your words to their stems so that conjugations
produce less noise in your data. For example, the words “speak”, “spoke”, and “speaking”
are all likely to denote a similar context, and so a stemmed tokenization will merge all of
them into a single stem. R has several libraries for tokenization, stemming, and text mining.
Examples of such libraries that you may want to use as a starting point are tokenizers,
SnowballC, and tm, respectively. Alternatively, some of you may want to consider using
quanteda, which will handle these functionalities along with others needed in building your
model in the next step. Similarly, Python has libraries such as sklearn and nltk for
processing text.
You will need to produce a document-term matrix from your stemmed tokenized data. This
will create a large feature set (to be reduced in the following step) where each word
represents a feature, and each article is represented by the number of occurrences of each
word.
2
Before representing the feature set in a non-compact storage format (such as a plain
matrix), you will want to remove any word which appears in too few documents. For this
assignment, you will remove 15% of the words corresponding to the least frequent words
in the document i.e., only 85% of the terms should be kept.
To demonstrate your completion of this part, print the feature vector of the words that are
appear 4 or more times in the 2205th article in the dataset. Your output should show the
words and the number of occurrences in the article.
2. (20 points) Classification
For this part of the assignment, you will build and test a Multinomial Naïve Bayes classifier
and a Multinomial Logistic Regression classifier to handle the multiple classes in the
dataset.
First, reduce the feature set using a feature selection method, such as removing highly
correlated features. The caret package in R or similar libraries in Python like sklearn can
be used to help reduce the number of features and improve model performance. You may
wish to try several different feature selection methods to see which produces the best
results.
Next, split your data into a training set and a test set. Your training set should comprise
approximately 80% of your articles and your test set the remaining 20%. In splitting your
data into training and test sets, ensure that the five categories are nearly equally represented
in both sets. Experiment with other split percentages (than 80-20) to ensure a balanced
representation of the five categories, and use the split that gives you the best result for the
required classifiers below.
Next, build a Multinomial Naïve Bayes classifier using your training data. In R, you can
use the multinomial_naive_bayes() function from the naivebayes package, and in Python,
the MultinomialNB class from sklearn can be used. After building the model, use it to
predict the categories of your test data.
Once you have produced a model that generates the best predictions you can get, print a
confusion matrix of the results to demonstrate your completion of this task. For each class,
give scores for precision (TruePositives / TruePositives+FalsePositives) and recall
(TruePositives / TruePositives+FalseNegatives).
Finally, build a Multinomial Logistic Regression classifier using the same training and test
sets and compare the results using a confusion matrix, as well as precision and recall scores
for each class, with those from the Multinomial Naïve Bayes classifier.




