Sale!

CSE 891-001 Deep Learning Programming Homework 3 Solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (5 votes)

Introduction:
In this assignment, you will train two attention-based neural machine translation (NMT) model to translate
words from English to Pig-Latin. Along the way, you will gain experience with several important concepts
in NMT, including gated recurrent neural networks and attention.
Pig Latin: It is a simple transformation of English based on the following rules applied on a per-word basis:
1. If the first letter of a word is a consonant, then the letter is moved to the end of the word, and the
letters “ay” are added to the end. For instance, team → eamtay.
2. If the first letter is a vowel, then the word is left unchanged and the letters “way” are added to the
end: impress → impressway.
3. In addition, some consonant pairs, such as “sh”, are treated as a block and are moved to the end of
the string together: shopping → oppingshay.
To translate a whole sentence from English to Pig-Latin, we simply apply these rules to each word independently: i went shopping → iway entway oppingshay
We would like to train a neural machine translation model to learn the rules of Pig-Latin implicitly, from
(English, Pig-Latin) word pairs. Since the translation to Pig Latin involves moving characters around in a
string, we will use character-level recurrent neural networks for our model. Since English and Pig-Latin
are structurally very similar, the translation task is almost a copy task; the model must remember each
character in the input, and recall the characters in a specific order to produce the output. This makes it an
ideal task for understanding the capacity of NMT models.
Data: The data for this task consists of pairs of words {(s
(i)
, t
(i)
)}
N
i=1 where the source s
(i)
is an English
word, and the target t
(i)
is its translation in Pig-Latin. The dataset is composed of unique words from
the book “Sense and Sensibility” by Jane Austen. The vocabulary consists of 29 tokens: the 26 standard
alphabet letters (all lowercase), the dash symbol −, and two special tokens, start of sentence and
end of sentence . The dataset contains 6387 unique (English, Pig-Latin) pairs in total; the first few
examples are: { (the, ethay), (family, amilyfay), (of, ofway), . . . }
In order to simplify the processing of mini-batches of words, the word pairs are grouped based on the
lengths of the source and target. Thus, in each mini-batch the source words are all the same length, and the
1
target words are all the same length. This simplifies the code, as we don’t have to worry about batches of
variable-length sequences.
Encoder-Decoder NMT Setup: Translation is a sequence-to-sequence problem: in our case, both the input
and output are sequences of characters. A common architecture used for seq-to-seq problems is the encoderdecoder model [1], composed of two RNNs, as follows:
Encoder
Decoder
c a t
h
enc
0

a t c a y
a t c a y
h
dec
T
The encoder RNN compresses the input sequence into a fixed-length vector, represented by the final hidden
state h
enc
T
. The decoder RNN conditions on this vector to produce the translation, character by character.
Input characters are passed through an embedding layer before they are fed into the encoder RNN; in
our model, we learn a 29×10 embedding matrix, where each of the 29 characters in the vocabulary is assigned a 10-dimensional embedding. At each time step, the decoder RNN outputs a vector of unnormalized
log probabilities given by a linear transformation of the decoder hidden state. When these probabilities are
normalized, they define a distribution over the vocabulary, indicating the most probable characters for that
time step. The model is trained via a cross-entropy loss between the decoder distribution and ground-truth
at each time step.
The decoder produces a distribution over the output vocabulary conditioned on the previous hidden
state and the output token in the previous time step. A common practice used to train NMT models is to
feed in the ground-truth token from the previous time step to condition the decoder output in the current
step, as shown in the previous question. At test time, we don’t have access to the ground-truth output
sequence, so the decoder must condition its output on the token it generated in the previous time step, as
shown below.
Encoder
Decoder
c a t
h
enc
0

a t c a y
h
dec
T
2
1 Gated Recurrent Unit (2pts): The forward pass of a Gated Recurrent Unit [2] is defined by the following
equations:
rt = σ(Wxrxt + Whrht−1 + br) (1)
zt = σ(Wxzxt + Whzht−1 + bz) (2)
gt = tanh(Wxgxt + rt (Whght−1 + bg)) (3)
ht = (1 − zt) gt + zt ht−1 (4)
1. (1pt) Although PyTorch has a GRU built in (nn.GRUCell), we will implement our own GRU cell
from scratch, to better understand how it works. Fill in the init and forward methods of the
MyGRUCell class in models.py, to implement the above equations. A template has been provided for
the forward method.
2. (1pt) Train the RNN encoder/decoder model. We provided implementations for recurrent encoder/decoder models using the GRU cell. By default, the script runs for 100 epochs. At the end of
each epoch, the script prints training and validation losses, and the Pig-Latin translation of a fixed
sentence, “the air conditioning is working”, so that you can see how the model improves qualitatively over time. The script also saves several items to the directory h20-bs64-rnn:
• The best encoder and decoder model parameters, based on the validation loss.
• A plot of the training and validation losses.
After the training is complete, we will now use this model to translate the words using translate sentence function. Try a few of your own words by changing the variable
TEST SENTENCE. Identify two distinct failure modes and briefly describe them.
Deliverables:
• A python file with code: models.py
• Create a report with a section called Gated Recurrent Units. Include the training/validation loss
plots.
• Make sure to add at least one input-output pair for each failure case you identify.
3
2 Additive Attention (3pts): Attention allows a model to look back over the input sequence, and focus
on relevant input tokens when producing the corresponding output tokens. For our simple task, attention
can help the model remember tokens from the input, e.g., focusing on the input letter c to produce the
output letter c.
The hidden states produced by the encoder while reading the input sequence, {h
enc
1
, . . . , h
enc
T
}, can be
viewed as annotations of the input; each encoder hidden state h
enc
i
captures information about the i
th
input token, along with some contextual information. At each time step, an attention-based decoder
computes a weighting over the annotations, where the weight given to each one indicates its relevance in
determining the current output token.
In particular, at time step t, the decoder computes an attention weight αi
for each of the encoder hidden
states h
enc
i
. The weights are defined such that 0 ≤ αi ≤ 1 and ∑i αi = 1. αi
is a function of an encoder
hidden state and the previous decoder hidden state, f(h
dec
t−1
, h
enc
i
), where i ranges over the length of the
input sequence.
There are a few different choices for the possible function f . In this assignment we will explore two
different attention models: (1) the additive attention using a two-layer MLP, and (2) the scaled dot product
attention which measures the similarity between the two hidden states.
To unify the interface across different attention modules, we consider attention as a function whose inputs
are triple (queries, keys, values), denoted as (Q, K, V). In the additive attention, we will learn the function
f , parameterized as a two-layer fully-connected network with a ReLU activation. This network produces
unnormalized weights α˜
(t)
i
that are used to compute the final context vector.
For the forward pass, you are given a batch of query of the current time step, which has dimension
batch size x hidden size, and a batch of keys and values for each time step of the input sequence,
both have dimension batch size x seq len x hidden size. The goal is to obtain the context vector.
We first compute the function f(Qt, K) for each query in the batch and all corresponding keys Ki
, where
i ranges over seq len different values. You must do this in a vectorized fashion. Since f(Qt, Ki) is a
scalar, the resulting tensor of attention weights should have dimension batch size x seq len x 1. The
AdditiveAttention module should return both the context vector batch size x 1 x hidden size and
the attention weights batch size x seq len x 1.
We will now apply the AdditiveAttention module to the RNN decoder. You are given a batch of decoder
hidden states as the query, h
dec
t−1
, for time t − 1, which has dimension batch size x hidden size, and a
batch of encoder hidden states as the keys and values, h
enc = [h
enc
1
, . . . , h
enc
i
, . . . ] (annotations), for each
timestep in the input sequence, which has dimension batch size x seq len x hidden size.
Qt ← h
dec
t−1 K ← h
enc V ← h
enc (5)
We will use these as the inputs to the self.attention to obtain the context. The output context vector is
concatenated with the input vector and passed into the decoder GRU cell at each time step, as shown in
the figure below. Fill in the forward method of the RNNAttentionDecoder class to implement the interface
shown in the figure below. You will need to implement the following steps :
• Get the embedding corresponding to the time step. (provided)
• Compute the context vector and the attention weights using self.attention. (implement)
• Concatenate the context vector with the current decoder input. (implement)
• Feed the concatenation to the decoder GRU cell to obtain the new hidden state. (provided)
4 CSE 891-001
h · · ·
enc
1
h
enc
2
h
enc
T−1
h
enc
T
+
h
dec
t
h
dec
t−1
ot
xt
α1 α2 αT−1 αT
• Train the model with attention by running the following command:
python main.py
By default, the script runs for 100 epochs, which should be enough to get good results; this takes
approximately 24 minutes on my desktop.
At the end of each epoch, the script prints training and validation losses, and the Pig-Latin translation of a fixed sentence, “the air conditioning is working”, so that you can see how the model
improves qualitatively over time. The fixed sentence is stored in the variable TEST SENTENCE in
main.py. You can change this variable to see how translation improves for your own sentence.
The script also saves several items to the directory checkpoints/h20-bs64-rnn attention:
– The training and validation loss curves.
– The best encoder and decoder model parameters, based on the validation loss.
Deliverables: Create a section called Additive Attention and include the following in this section:
• A python file with code: models.py
• Create a section called Additive Attention. Include the training/validation loss plots.
• Make sure to add at least one input-output pair for each failure case you identify.
5  CSE 891-001
3 Scaled Dot Product Attention (3pts): In lecture, we introduced Scaled Dot-product Attention used in
the transformer models. The function f is a dot product between the linearly transformed query and keys
using weight matrices Wq and Wk
:
α˜
(t)
i = f(Qt
, Ki) = (WqQt)
T
(WkKi)

d
(6)
α
(t)
i = so f tmax(α˜
(t)
)i
(7)
ct =
T

i=1
α
(t)
i Wvvi
(8)
where, d is the dimension of the query and Wv is the weight matrix corresponding to the values vi
.
1. Fill in the forward methods of the ScaledDotAttention class. Use the torch.bmm command to
compute the dot product between the batched queries and the batched keys in the forward pass
of the ScaledDotAttention class for the unnormalized attention weights. The following functions
are useful in implementing models like this. You might find it useful to get familiar with how they
work. (click to jump to the PyTorch documentation): squeeze, unsqueeze, cat, expand as, view,
bmm. Your forward pass needs to work with both 2D query tensor batch size x hidden size and
3D query tensor batch size x k x hidden size.
2. Fill in the forward method in the CausalScaledDotAttention class. It will be mostly the same as the
ScaledDotAttention class. The additional computation is to mask out the attention to the future
time steps. You will need to add self.neg inf to some of the entries in the unnormalized attention
weights. You may find torch.tril handy for this part.
3. We will now use ScaledDotAttention as the building blocks for a simplified transformer [3] encoder. The encoder consists of three components (already provided):
• Positional encoding: Without any additional modifications, self attention is permutationequivariant. To encode the position of each word, we add to its embedding a constant vector
that depends on its position:
pth word embedding = input embedding + positional encoding(p) (9)
We follow the same positional encoding methodology described in [3]. That is we use sine and
cosine functions:
PE(pos, 2i) = sin pos
100002i/dmodel
(10)
PE(pos, 2i + 1) = cos
pos
100002i/dmodel
(11)
(12)
• A ScaledDotAttention operation
• An MLP
Now, complete the forward method of TransformerEncoder. Most of the code is provided, except
for a few lines with . . . in them. Complete these lines.
6
4. The decoder, in addition to all the components the encoder has, also requires a
CausalScaledDotAttention component. The transformer solves the translation problem using layers of attention modules. In each layer, we first apply the CausalScaledDotAttention
self-attention to the decoder inputs followed by ScaledDotAttention attention module to the
encoder annotations, similar to the attention decoder from the previous question. The output of
the attention layers are fed into an hidden layer using ReLU activation. The final output of the
last transformer layer are passed to self.out to compute the word prediction. To improve the
optimization, we add residual connections between the attention layers and ReLU layers. Now,
complete the forward method of TransformerDecoder. Again, most of the code is given to you – fill
in the lines that have . . . .
5. Now, train the language model with transformer based encoder/decoder. How do the translation
results compare to the previous decoders? Write a short, qualitative analysis.
6. Modify the transformer decoder init to use non-causal attention for both self attention and
encoder attention. What do you observe when training this modified transformer? How do the
results compare with the causal model? Why?
7. What are the advantages and disadvantages of using additive attention vs scaled dot-product attention? List one advantage and one disadvantage for each method.
Deliverables: Create a section in your report called Scaled Dot Product Attention. Include the following:
• Training/validation plots you generated. Your response to question 5.
• Your response to question 6.
• Your response to question 7.
7
4 Visualizing Attention (2pts): One of the benefits of using attention is that it allows us to gain insight
into the inner workings of the model. By visualizing the attention weights generated for the input tokens
in each decoder step, we can see where the model focuses while producing each output token. In this
part of the assignment, you will visualize the attention learned by your model, and try to find interesting
success and failure modes that illustrate its behavior.
The script visualize attention.py loads a pre-trained model and uses it to translate a given set of
words: it prints the translations and saves heatmaps to show how attention is used. To call this script, you
need to pass in the path to a checkpoint folder, as follows:
python visualize attention.py –load checkpoints/h20-bs64-rnn attention
python visualize attention.py –load checkpoints/h20-bs64-transformer
The visualize attention.py script produces visualizations for the strings in the words list,
found at the top of the script. The visualizations are saved as PDF files in the same directory as the loaded model checkpoint, so they will be in checkpoints/h20-bs64-rnn attention or
checkpoints/h20-bs64-transformer. Add your own strings to words list in visualize attention.py
and run the script as shown above. Since the model operates at the character-level, the input doesn’t even
have to be a real word in the dictionary. You can be as creative as you would like to !! After running the
script, you should examine the generated attention maps. Try to find failure cases, and hypothesize about
why they occur. Some interesting classes of words you may want to try are:
1. Words that begin with a single consonant (e.g., fake).
2. Words that begin with two or more consonants (e.g., brink).
3. Words that have unusual/rare letter combinations (e.g., aardvark).
4. Compound words consisting of two words separated by a dash (e.g., long-term). These are the
hardest class of words present in the training data, because they are long, and because the rules of
Pig-Latin dictate that each part of the word (e.g., well and mannered) must be translated separately,
and stuck back together with a dash: onglay-ermtay.
5. Made-up words or toy examples to show a particular behavior.
Include attention maps for both success and failure cases in your writeup, along with your hypothesis
about why the model succeeds or fails.
Submission: Here is everything you need to submit.
• A python file with code: models.py
• A PDF document titled programming-assignment-3-msunetid.pdf containing your answers to the
conceptual questions and the attention visualizations with explanations.
References
[1] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In
Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
[2] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger ¨
Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical
machine translation. arXiv preprint arXiv:1406.1078, 2014.
8
[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing
systems, pages 5998–6008, 2017.
9