Name: CS4120/CS6120 Assignment 1 solution
SKU: 22004
Price: 30.00 USD
Availability: InStock

Description

5/5 - (10 votes)

1 Naive Bayes for Text Categorization [10 points]

Given the following short documents, each labeled with a genre (class):
1. murder, cars, map, treasure : Action
2. treasure, love, conspiracy, murder : Drama
3. conspiracy, robbery, crash, treasure : Action
4. cars, murder, enemy, robbery: Drama
5. cars, crash, robbery, family: Action
6. family, vendetta, conspiracy, betrayal : Drama
7. betrayal, treasure, cars, enemy : Action
8. conspiracy, family, enemy, betrayal : Drama
And test documents:
1. D1 : murder, betrayal, enemy, conspiracy
1 CS4120/CS6120 Assignment 1

2. D2 : cars, treasure, robbery, crash
Your task is to compute the most likely class for D1 and D2. You will start with building a
Naive Bayes classifier and using add-λ smoothing (with λ = 0.2). For this question, show your
work of computing prior, conditional, and posterior probabilities. Do not write/submit code for
this question

2 Word Sense Disambiguation and Feature Selection [25 points]
2.1 Feature Construction [5 points]

Given the following sentences:
1. The company board1 of directors has seven members.
2. Several loose board2 creaked as I walked on them.
3. We all board3 the plane for Oslo tomorrow.
4. John nailed the board2 over the window.
5. Bella wanted to board3 the bus to Chicago.
6. They needed to get more senators on board4 for the bill to pass.

The aforementioned sentences show four different contexts in which the word “board” can be
used, marked as 1, 2, 3 and 4. As discussed in class, collocational features, the features at specific
positions near target word, can be used to train a supervised learning model, such as Naive Bayes,
to perform word sense disambiguation. Considering the aforementioned sentences as the known
corpus, answer the question below.

Find the collocation features from a window of two words to the right and the left of the word
“board”. Prepare to present the features in terms of the words and their respective part-of-speeches
for each sentence. Format can be : [wi−2, P OSi−2, wi−1, P OSi−1, wi+1, P OSi+1, wi+2, P OSi+2],
where i is the index of word “board” in a given sentence.
No need to write code, show the answer directly.

2.2 Selecting Salient Features [20 points]

In the class you saw that there are two kinds of features: i) collocational and ii) bag of words. Having seen how to extract collocational features (by hand) to disambiguate the sense for a word with
multiple senses in the previous problem, now let’s try to understand which features are important
to disambiguate the word senses on a bigger corpus. You can use a combination of collocational
and bag of words features to conduct feature selection.

Dataset: We will use the English Lexical Sample task from Senseval for this problem. The
data files for this project are available here: https://bit.ly/2kKEgwx
2
It consists of i) a corpus file (wsd data.xml) and ii) a dictionary (dict.xml) that describes commonly
used senses for each word. Both these files are in XML format. Every lexical item in the dictionary
file contains multiple sense items, and each instance in the training data is annotated with the
correct sense of the target word for a given context.

The file wsd data.xml contains several tags corresponding to each word in the
corpus. Each tag has an attribute item, whose value is “word.pos”, where “word” is the
target word and “pos” represents the part-of-speech of the target word. Here ‘n’, ‘v’, and ‘a’ stand
for noun, verb, and adjective, respectively.

Each tag has several tags, each corresponds to an instance for the word
that corresponds to the parent tag. Each tag also has an id attribute and contains
one or more and a tag. Every tag has two attributes, instance and
senseid. The senseid attribute identifies the correct sense from the dictionary for the present
word in the current context.

A special value “U” is used to indicate if the correct sense is unclear.
You can discard such instances from your feature extraction process for this assignment (we keep
these cases so that you can take a look and think about how they can be utilized as well for realworld applications).

A tag contains:

prev-contexttarget-wordnext-context1. prev-context is the actual text given before the target word2. head is the actual appearance of the target word. Note that it may be morphological variantof the target word. For example, the word “begin.v” could show up as “beginning” insteadof “begin” (lemma).3. next-context is the actual text that follows the target word.The dictionary file simply contains a gloss field for every sense item to indicate the corresponding definition. Each gloss consists of commonly used definitions delimited by a semicolon, andmay have multiple real examples wrapped by quotation marks being also delimited by a semicolon.

2.2.1 Feature extraction [10 points]

Your first task is to extract features from the aforementioned corpus. (1) Start with bag-of-wordfeatures and collocation features (define your own window size, see Hints below). (2) Design newtype of features. Submit the code and output for both.

2.2.2 Feature selection [10 points]

Now, with the extracted features, perform feature selection to list top 10 features that would bemost important to disambiguate a word sense. (1) Design your own feature selection algorithmand explain the intuition. (2) List the top 10 features in your answer key and also provide yourcode for this task. Submit the code and output for both.You can use following resources to read ways to perform feature selection:https://scikit-learn.org/stable/modules/feature_selection.htmlhttps://www.datacamp.com/community/tutorials/feature-selection-python

3 CS4120/CS6120 Assignment 1

Hints:

1. You may want to represent each instance as a feature vector. Recall, a feature vector isa numerical representation for a particular data instance. Each feature vector will have arespective target, a numerical representation of the sense that the instance was labeled with.

2. In order to work with the scikit-learn API, you will have to format all of your feature vectorsinto a numpy array, where each row of the array is an individual feature vector. You willhave to do the same for your targets – put them into a 1-dimensional numpy array.

3. As seen in class, bag of words features are the count for each word within some window sizeof the head word. For this problem, you can set the window size large enough to include allof the tokens in the context. You may want to keep punctuation, numbers, etc as they can beuseful?

4. As also seen in previous problem, collocational features are the n-grams that include the headword and their counts. For this problem, you can just extract the bi-grams and tri-grams thatinclude the head word. If there are multiple head words in a single context, you shouldextract bi-grams and tri-grams for each one. You can represent these n-grams as a single“word” by joining the tokens in the n-gram together using underscores to form features.

3 Language Modeling [30 points]

Your task is to train trigram word-level language models from the following English training dataand then test the modelsTraining data:https://www.dropbox.com/s/puo7jygnh9m52ze/train.zip?dl=0For all questions, please submit both the code and output.Data preprocessing instructions:1. Remove blank lines from each file.2. Replace newline characters.3. Remove duplicate spaces.4. Replace words in training set that appear ≤ 3 times as “UNK”.5. Do not remove punctuation

4 CS4120/CS6120 Assignment 1

.1 [5 points]Across all files in the directory (counted together), report the unigram, bigram, and trigram wordlevel counts. Submit these counts in a file named ngramCounts.txt.Note: You can use any word tokenizer to tokenize the dataset e.g. nltk word tokenize, althoughfor creating the n-grams do not use any libraries.3.2 [10 points]For the given test dataset:https://www.dropbox.com/s/ik98szmqbsq2wtd/test.zip?dl=0Calculate the perplexity for each file in the test set using linear interpolation smoothing method.

For determining the λs for linear interpolation, you can divide the training data into a new trainingset (80%) and a held-out set (20%) , then using grid search method.1. First, report all the candidate lambdas used for grid search and the corresponding perplexitiesyou got on the held-out set2. Report the best λs chosen from the grid search, and explain why it’s chosen (i.e. leveragingthe perplexities achieved on the held-out set).3. Report the perplexity for each file in the test set (use the best λs obtained from grid searchto calculate perplexity on test set).

4. Based on the test file’s perplexities you got write a brief observation comparing the test files.Submit these perplexities and your report in a file named perplexitiesInterpolation.txt.3.3 [10 points]Build another language model with add-λ smoothing. Use λ = 0.1 and λ = 0.3.1. Report the perplexity for each file in the test set (for both the λ values).2. Based on the test file’s perplexities you got write a brief observation comparing the test files.Submit these perplexities and your report in a file named perplexitiesAddLambda.txt.

3.4 [5 points]Based on your observation from above questions, compare linear interpolation and add-lambdasmoothing by listing out their pros and cons.5 CS4120/CS61204

POS Tagging with HMM and Sentence Generation [35 points]

The training dataset is a subset of the Brown corpus, where each file contains sentences in the formof tokenized words followed by POS tags. Each line contains one sentence. Training datasetcan be downloaded from here: https://bit.ly/2kJI0yc The test dataset (which is another subsetof the Brown corpus, containing tokenized words but no tags) can be downloaded from here:https://bit.ly/2lMybzP Information regarding the categories of the dataset can be found at:https://bit.ly/2mhF6RT.Your task is to implement a part-of-speech tagger using a bi-gram HMM.

Given an observationsequence of n words wn1, choose the most probable sequence of POS tags tn1. For the questionsbelow, please submit both code and output.[Note: During training, for a word to be counted as unknown, the frequency of the word intraining set should not exceed a threshold (e.g. 5). You can pick a threshold based on your algorithm design. Also, you can implement smoothing technique based on your own choice, e.g.add-α.]4.1 [5 points]Obtain frequency counts from the collection of all the training files (counted together).

You willneed the following types of frequency counts: word-tag counts, tag un-igram counts, and tag bigram counts. Let’s denote these by C(wi, ti), C(ti) and C(ti−1, ti) respectively. Report thesequantities in different output files.4.2 [2 points]A transition probability is the probability of a tag given its previous tag. Calculate transitionprobabilities of the training set using the following equation:P(ti−1, ti) = C(ti−1, ti)C(ti−1)4.3 [3 points]An emission probability is the probability of a given word being associated with a given tag.

CS4120/CS6120 Assignment 1

Calculate emission probabilities of the training set using the following equation:P(wi, ti) = C(wi, ti)C(ti)4.4 [10 points]Generate 5 random sentences using the previously learned HMM. Output each sentence (with thePOS tags) and its probability of being generated.6 Hint:With the help of emission probabilities and transition probabilities collected from 4.2 and 4.3,1. Start with ‘’ tag.2.

Choose next tag based on random choice but considering probabilitiese.g. tag draw = random.choices(,,).3. Now choose word for the corresponding tag using emission probabilities (all the words thatcan be generated from that tag and corresponding probabilities they can be generated with.)e.g. word draw = random.choices(,,).

4. Keep repeating steps 2 and 3 till you hit end token ‘.’5. Report the sentence and the probability with which this sentence can be generated.4.5 [15 points]For each word in the test dataset, derive the most probable POS tag sequence using the Viterbi algorithm; pseudo-code can be found in the textbook http://web.stanford.edu/ jurafsky/slp3/ed3book.pdfunder

Figure 8.5. Viterbi algorithm should be implemented following the pseudocode providedfor reference.Hint: Traversing through back-pointer data structure at the end of algorithm would provideinformation about the best possible previous tag. So when you are at the second last word ofthe sentence, calling back-pointer here would give the tag information for the first word in thesentence.

Submit the output in a file exactly with the following format (where each line contains no morethan one pair):< sentenceID = 1 >word/tagword/tag….word/tag< EOS >< sentenceID = 2 >word/tagword/tag….word/tag< EOS >7

CS4120/CS6120 Assignment 1

CS4120/CS6120 Assignment 1 solution