Sale!

ECE 367 MATRIX ALGEBRA AND OPTIMIZATION Problem Set #1 solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (6 votes)

THEORY PROBLEMS
Problem 1.1 (Showing `1 and `∞ are both norms)
Show
(a) that the functions `1 : R
n → R is a norm, and
(b) that `∞ : R
n → R is a norm
by verifying that they satisfy all the properties of a norm, cf., OptM Definition 2.1.
Problem 1.2 (Norm inequalities)
OptM Prob. 2.6.
Note: This problem shows that each norm (l1, l2, l∞) is both upper- and lower-bounded by each of
the other norms to within a constant (dimension-dependent) factor.
Problem 1.3 (Linear independence of stacked vectors)
IALA Prob. 5.1.
Problem 1.4 (Orthogonality)
OptM Prob. 2.5.
Problem 1.5 (Inner products)
OptM Prob. 2.4.
Problem 1.6 (Distance versus angle nearest neighbors)
IALA Problem 3.24.

APPLICATION PROBLEMS
Problem 1.7 (Angles between word vectors)
In this problem you investigate how geometric concepts such as distance and angle can be applied to quantify similarity between text documents. Download the files wordVecArticles.txt,
wordVecTitles.txt, wordVecWords.txt and wordVecV.mat from the course website. The first
two files each have ten lines. Each line in the first file consists of the text of one Wikipedia article.
The corresponding line of the second file is the title of the article. The last two files are described
in detail below.
Denote by D the set of documents where the number of documents is |D|. (In our dataset |D| = 10).
Let W denote the union of words in all articles, i.e., the lexicon of the set of documents. We denote
the cardinality of W by |W|. Assume the lexicon is ordered “lexiographically” (e.g., alphabetically)
so that there is a one-to-one mapping from each word w ∈ W to an element of the index set t ∈ [|W|].
Let fterm(t, d) denote the number of times the word w ∈ W that is indexed as t ∈ [|W|] appears in
the dth article where d ∈ [|D|]. Note that P|W|
t=1 fterm(t, d) is the number of words (the length) of
the dth article. We refer to fterm(t, d) as the term frequency (really “term count”).
For the first few parts of this problem you will be using a pre-processed W set and pre-computed
fterm(t, d) values. The pre-processed data appears in the files wordVecWords.txt and wordVecV.mat.
The first file represents the set W where elements of W are listed line by line, for 1651 lines, i.e.,
|W| = 1651. You can load the content in the second file into MATLAB by using command
load ‘wordVecV.mat’. After loading, you will see a matrix V of dimensions 1651 × 10. The value
in the tth row and dth column of this matrix is fterm(t, d). Use the provided data in V to answer
parts (a) to (d) of this problem.
(a) Let the |W|-dimensional vectors vd, d ∈ [|D|] be defined as
vd = (fterm(1, d), fterm(2, d), . . . , fterm(|W|, d)). Using vd to represent the dth document,
which two articles are closest in Euclidean distance (smallest distance)? Which two are
closest in angle distance (smallest angle)? Are they the same pair, if not, what could be a
reason for them being different? You may find the pdist command in MATLAB useful for
computing pairwise Euclidean and angle distances of vectors.
(b) In this part let the |W|-dimensional normalized vectors ˜vd, d ∈ [|D|] be defined as ˜vd =
vd/
P|W|
t=1 fterm(t, d), where the vd are defined as in the previous part. Using ˜vd to represent
the dth document, which two articles are closest in Euclidean distance (smallest distance)?
Which two are closest in angle distance (smallest angle)? Are your answers the same as in
the previous part? What would be a reason for using this normalization?
Now, let fdoc(t) = P|D|
d=1 I[fterm(t, d) > 0] where I(·) is the indicator function taking value one if
the clause is true and zero else. The function fdoc(t) counts in how many documents the tth word
appears. We refer to fdoc(t) as the document frequency.

We combine the term and document frequency definitions into what is called the term frequencyinverse document frequency score (TF-IDF), defined as
w(t, d) = fterm(t, d)
P|W|
t=1 fterm(t, d)
s
log 
|D|
fdoc(t)

.
Note, the denominator of the log is never zero since, by definition, each term appears in at least
one document.
(c) Now let the |W|-dimensional vectors wd, d ∈ [|D|] be defined as wd = (w(1, d), w(2, d), . . . , w(|W|, d)).
Using wd to represent the dth document, which two articles are closest in Euclidean distance
(smallest distance)?
(d) What might be a reason for using the “inverse document frequency” adjustment? What is
the adjustment doing geometrically?
OPTIONAL PART: The following part (e) is a optional part.
(e) In the previous parts of the problem you used the pre-processed W set and pre-computed
fterm(t, d) values provided in wordVecV.mat and term indexes in wordVecWords.txt. In
this part of the problem you will reproduce your results from (a) to (d) without using this
pre-computed data. Specifically, start from the raw text files wordVecArticles.txt and
wordVecTitles.txt, and write code to obtain W and fterm(t, d). As always, you are welcome
to use whichever software language you wish to solve this problem. We note that Python is
particularly well suited to text processing. For example, you may find the snippet of Python
code following the problem statement useful. This snippet loads in data from the given text
file and stores it in the variable articles. It also counts the number of word occurrences in
the first article and stores the resulting (word, count) pairs in the variable wordcounts. You
could use this as a starting point in the generation of the vectors we provide in wordVecV.mat
and then calculate angles and distances in MATLAB. Alternately, you could show how to
complete the entire processing pipeline in Python. Be sure to include a printout of your code.
from collections import Counter
articles = [ line . rstrip (‘\n ‘) for line in open (‘ wordVecArticles . txt ‘) ]
wordcounts = Counter ( articles [0]. split () )