Sale!

DSCC46 Problem Set 4 Solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (9 votes)

Int. to Statistical Machine Learning
Questions
For (most of) the questions below, please use the fake news dataset uploaded on BlackBoard
(called ‘corona_fake.csv’). You can find the file under ‘Data’ tab.
Please include your code also in your .pdf file (in code blocks).
Data Pre-Processing (40 points)
1) [20 points] Using the pandas package for Python, import the corona_fake.csv dataset,
and do the following:
a) [5 points] Import the nltk package. Check the documentation:
https://www.nltk.org/
b) [15 points] Take a look at the text column in the dataset, and do the following:
i. [3 points] Using nltk.word_tokenize(), tokenize the text.
ii. [3 points] Using the POS-tagging feature (nltk.pos_tag), POS-tag the
tokenized words.
iii. [3 points] Using WordNetLemmatizer (from nltk.stem import
WordNetLemmatizer) lemmatize the pos-tagged words you obtained
above. (Hint: If there is no available tag, append the token as is; else, use the
tag to lemmatize the token)
2 DSCC46
iv. [3 points] Using the list of stop words that can be imported (nltk.corpus
import stopwords), remove the stopwords in lemmatized text [Note: the
language needs to be set as ‘english’.].
v. [3 points] Finally, also remove numbers, words that are shorter than 2
characters, punctuation, links and emojis. Finally, convert the obtained list of
tokenized+tagged+lemmatized+cleaned list of words back into a joined string
(joined by space ‘ ‘ ) and add the result as text_clean column to your dataset.
2) [20 points] Let’s vectorize the data we produced above by using two approaches: Bag of
Words (BOW) and TF-IDF; and, at the end, we will make a prediction:
a. [5 points] Read the following page: https://en.wikipedia.org/wiki/N-gram. Explain
what an ‘n-gram’ is and why it is helpful in max. 200 words.
b. [5 points] Import CountVectorizer and TfidfVectorizer:
from sklearn.feature_extraction.text import
CountVectorizer,TfidfVectorizer
c. [5 points] Using CountVectorizer, create three vectorized representations of
text_clean [set lowercase=True]:
i. One vectorized representation where ngram_range = (1,1)
ii. One vectorized representation where ngram_range = (1,2)
iii. One vectorized representation where ngram_range = (1,3)
d. [5 points] Using TfidfVectorizer, create three vectorized representations of
text_clean [set lowercase=True]:
i. One vectorized representation where ngram_range = (1,1)
ii. One vectorized representation where ngram_range = (1,2)
iii. One vectorized representation where ngram_range = (1,3)
Prediction (20 points)
3) [20 points] Now, let’s use sklearn.linear_model.LogisticRegressionCV
to do some predictions. Set cv = 5, random_state = 265, and max_iter =
1000, and n_jobs = -1 (other parameters should be left as default) [Note: training
size is 70%, test size is 30%, split by random_state = 265].
a. [10 points] By using the three (3) different versions of the CountVectorizer
dataset you created above, run logistic regression to predict class labels (fake,
true). Report three (3) accuracy values associated with each of the regressions.
b. [10 points] By using the three (3) different versions of the TfidfVectorizer
dataset you created above, run logistic regression to predict class labels (fake,
true). Report three (3) accuracy values associated with each of the regressions.
c. Combine and report all accuracy values in a table (6 values in total).
3 DSCC46
Theoretical question (40 points)
4) [40 points] Check the optimizer (solver) functions used by
sklearn.linear_model.LogisticRegressionCV. For each function, explain
in around 100 words what they mean; specifically:
a. [8 points] What does newton-cg mean?
b. [8 points] What does lbfgs mean?
c. [8 points] What does liblinear mean?
d. [8 points] What does sag mean?
e. [8 points] What does saga mean?
Note: For this question you might need to do some online research. It is your job to find
out how they work. You are also welcome to use formulas / matrices in your description.

DSCC46