Description

5/5 - (1 vote)

Ling 473 Assignment 1

1. (25 points) Write a paragraph describing how you became interested in Computational Linguistics,
any projects or specific areas you’re interested in, and/or career goals. How would you characterize
your experience in linguistics, math, or computer programming (or other relevant engineering)?
Recalling the lecture slides, which of the subfields or subtasks of Computational Linguistics are you
particularly interested in?
2. (25 points) Consider the following sentence:
I saw that gas can explode.
a. How many phrase structure trees can you find for this sentence? Do not include pragmatically
odd interpretations. Draw each tree and provide a discriminating explanation of the situation
modeled by the interpretation.
b. Write the phrase structure trees from the previous question using Penn Treebank notation. That
is, write it with brackets and parentheses: (S (NP (NNP Kim)) (VP (VBZ sleeps)))
3. (10 points) How many six-letter “words” can be formed from the alphabet { a – z }? A “word” for
this question must have at least one vowel { a e i o u }, and may not contain all vowels. Show your
work and explain your answer.
4. (10 points) How many ways can the characters in the following tuple be arranged?
( 萄萄萄萄橙橙苹梨蕉 )
5. (30 points) Consider a document processing system which performs pairwise comparisons and a
corpus containing 19 documents as follows:
Topic Count
Conference Proceedings 7
Journal Articles 9
Workshop Abstracts 3
a. How many pairwise comparisons are possible between documents on the same topic?
b. How many pairwise comparisons are possible between documents on different topics?
** (10 points, extra credit) In the lecture, we showed that you can form
𝑛!
(𝑛 − 𝑘)! 𝑘!
different unordered sets of k distinct items from a set of n distinct items. Write an expression that
gives the number of unordered sets of k items that can be formed from a set of n distinct items while
allowing repetition in the output set.

Ling 473 Assignment 2

1. (60 points) Using the following sets, we run a trial which selects exactly one word from each set.
Within each set, all words are equally likely.
A = { monkey, donkey, yak, kangaroo, aardvark, antelope, puma, cheetah }
B = { whale, shark, dolphin, eel }
Let E be the event that either of the words contain a ‘y’
Let F be the event that both words contain an ‘e’
Let G be the event that both words contain the same number of letters
Let H be the event that either of the words contains more than two vowels { a e i o u }. This count
includes repeated uses of the same vowel.
a. Give values for the following:
P(E)
P(F)
P(G)
P(H)
P(E ⋃ H)
P(F ∩ H)
P(E ∩ F ∩ G)
P(H ⋃ G)
P(H ∩ F
C
)
b. Place a letter ‘X’ in the table corresponding to the events that are mutually exclusive. Place a letter
‘I’ in the table corresponding to the events that are independent.
E F G
H
G
F
2. (40 points) Working in Yunnan, a field linguist has discovered an extinct version of the Dongba
pictographic script. So far, his team has found 32 distinct glyphs in this script, and the linguist has
deciphered 22 of them. He just received news that another researcher has discovered a new inscription
that consists of 8 glyphs. These 8 have all previously been encountered, but he doesn’t yet know if the
new inscription has repeated glyphs, or not.
a. What is the probability that the linguist will fully understand the newly discovered inscription?
b. What is the probability that the linguist will understand at least half of the glyphs in the newly
discovered inscription?
(Extra Credit: 20 points) The linguist learns that the 8 glyphs in the new inscription are distinct
(but still in the set of 32 previously seen glyphs). Now what is the probability that the linguist
will understand at least half of them?

Ling 473 Assignment 3

1. (20 points) In Lecture 3, we looked at the outcomes of rolling two fair dice. For this problem, we will
consider weighted dice—one white, and one red. For each die, 1 and 6 are twice as likely to show as
the other four values.
a. What is the probability that the total showing on the two dice will be 7?
b. What is the probability that the total showing on the two dice will be 9 or higher?
c. What is the probability that the red die will show a higher number than the white one?
2. (35 points) The following is the first paragraph of Ernest Hemmingway’s The Old Man and The Sea.
It has been POS-tagged using the online Brill tagger at the Center for Sprogteknologi at Københavns
Universitet. A few minor changes have been applied.
This assignment does not require programming, but if you wish to work with an electronic version of
this information, you can refer to the following file:
/opt/dropbox/16-17/473/assignment3/old-man.txt
a. How many bigrams does the sample contain?
b. In a bigram model, we assume that a POS tag depends only on the POS tag of the preceding
word. Calculate 𝑃(. | NN), assuming that the counts in the above sample are perfectly
representative.
c. We are interested in the probability of the bigram DT JJ in the sample text. What is the value of
𝑃(DT JJ)?
d. A trigram model predicates a POS tag on the POS tags of the preceding bigram. Calculate
𝑃(NN | DT JJ) for the sample.
e. Assume this sample characterizes a larger corpus. Assume that measured probabilities are
independent. Estimate 𝑃(DT JJ | NN) for the corpus. (Hint: this will use Bayes’ Theorem.)
Show your work.
3. (15 points) For phonetic elicitation with a group of American test subjects, we are using three word
lists:
A = { gnat, beet }
B = { loon, fee }
C = { peel, pool, he, sand }
The test protocol is as follows: One of the lists is selected at random. Then, the subject is asked to
pronounce a randomly selected word from that list. What is the probability that the word will have a
high/close vowel (as opposed to low/open)? If you are not familiar with vowel phonetics, you can check
the Lecture 5 recording, or listen to samples on http://en.wikipedia.org/wiki/Vowel.
4. (30 points) A classifier has portioned a set of eight biomedical documents into
𝐶 = { mentions the IL-2R ⍺-promoter } (6 documents), and
𝐶̅ (the rest).
The gold standard indicates that only three documents actually mention the Interleukin-2 receptor alpha
promoter, and we determine that exactly one of them is (incorrectly) in 𝐶̅. In testing a post-processing
heuristic, we select a document at random from 𝐶 and move it in the class 𝐶̅. Next, we randomly select
a document from 𝐶̅.
a. What is the probability that the document we selected from 𝐶̅mentions the IL-2R ⍺-promoter
(according to the gold standard)?
b. Next, we note that the document we selected from 𝐶̅does, in fact (according to the gold
standard), mention the IL-2R ⍺-promoter. Given this additional information, what is the
probability that the document that we transferred from 𝐶 to 𝐶̅mentioned (according to the
gold standard) the IL-2R ⍺-promoter (i.e., that we moved it to the wrong class)?

Ling 473 Assignment 1 to 3 Group Project Solutions

Download Details:

Description

Ling 473 Assignment 1

Ling 473 Assignment 2

Ling 473 Assignment 3

Ling 473 Assignment 1 to 3 Group Project Solutions

Download Details:

Description

Ling 473 Assignment 1

Ling 473 Assignment 2

Ling 473 Assignment 3

Related products

Ling 473 Assignment 3 Solved

SOLVED: Ling 473 Project 5

SOLVED: Ling 473 Project 4