Sale!

CSC413/2516 Programming Assignment 3: Natural Language Processing and Multimodel Learning solved

Original price was: $35.00.Current price is: $30.00.

Download Details:

  • Name: PA3-drgtwf.zip
  • Type: zip
  • Size: 787.00 B

Category:

Description

5/5 - (5 votes)

Introduction
In this assignment, you will explore common tasks and model architectures in Natural Language
Processing (NLP). Along the way, you will gain experience with important concepts like recurrent
neural networks and sequence-to-sequence architectures (Part 1), attention mechanisms (Part 2),
pretrained language models (Part 3) and multimodal vision and language models (Part 4).
Setting Up
We recommend that you use Colab(https://colab.research.google.com/) for the assignment.
To setup the Colab environment, just open the notebooks for each part of the assignment and
make a copy in your own Google Drive account.
Deliverables
Each section is followed by a checklist of deliverables to add in the assignment writeup. To also give
a better sense of our expectations for the answers to the conceptual questions, we’ve put maximum
sentence limits. You will not be graded for any additional sentences.
2
CSC413/2516
Part 1: Neural machine translation (NMT) [2pt]
Neural machine translation (NMT) is a subfield of NLP that aims to translate between languages
using neural networks. In this section, will we train a NMT model on the toy task of English →
Pig Latin. Please read the following background section carefully before attempting the questions.
Background
The task
Pig Latin is a simple transformation of English based on the following rules:
1. If the first letter of a word is a consonant, then the letter is moved to the end of the word,
and the letters “ay” are added to the end: team → eamtay.
2. If the first letter is a vowel, then the word is left unchanged and the letters “way” are added
to the end: impress → impressway.
3. In addition, some consonant pairs, such as “sh”, are treated as a block and are moved to the
end of the string together: shopping → oppingshay.
To translate a sentence from English to Pig-Latin, we apply these rules to each word independently:
i went shopping → iway entway oppingshay
Our goal is to build a NMT model that can learn the rules of Pig-Latin implicitly from (English,
Pig-Latin) word pairs. Since the translation to Pig Latin involves moving characters around in a
string, we will use character-level recurrent neural networks (RNNs). Because English and PigLatin are similar in structure, the translation task is almost a copy task; the model must remember
each character in the input and recall the characters in a specific order to produce the output. This
makes it an ideal task for understanding the capacity of NMT models.
The data
The data for this task consists of pairs of words {(s
(i)
, t(i)
)}
N
i=1 where the source s
(i)
is an English
word, and the target t
(i)
is its translation in Pig-Latin.3 The dataset contains 3198 unique (English,
Pig-Latin) pairs in total; the first few examples are:
{ (the, ethay), (family, amilyfay), (of, ofway), … }
3
In order to simplify the processing of mini-batches of words, the word pairs are grouped based on the lengths of
the source and target. Thus, in each mini-batch, the source words are all the same length, and the target words are
all the same length. This simplifies the code, as we don’t have to worry about batches of variable-length sequences.
3
CSC413/2516 Programming Assignment 3
In this assignment, you will investigate the effect of dataset size on generalization ability. We
provide a small and large dataset. The small dataset is composed of a subset of the unique words
from the book “Sense and Sensibility” by Jane Austen. The vocabulary consists of 29 tokens:
the 26 standard alphabet letters (all lowercase), the dash symbol -, and two special tokens
and that denote the start and end of a sequence, respectively.4 The second, larger dataset
is obtained from Peter Norvig’s natural language corpus.5
It contains the top 20,000 most used
English words, which is combined with the previous data set to obtain 22,402 unique words. This
dataset contains the same vocabulary as the previous dataset.
The model
Translation is a sequence-to-sequence (seq2seq) problem. The goal is to train a model to transform
one sequence into another. In our case, both the input and output are sequences of characters. A
typical architecture used for seq2seq problems is the encoder-decoder model [8], composed of two
RNNs. The encoder RNN compresses the input sequence into a fixed-length vector, h
enc
T
. The
decoder RNN conditions on this vector to produce the translation, character by character. Input
characters are passed through an embedding layer before being fed into the encoder RNN. If H is
the dimension of the encoder RNN hidden state, we learn a 29×H embedding matrix, where each of
the 29 characters in the vocabulary is assigned a H-dimensional embedding. At each time step, the
decoder RNN outputs a vector of unnormalized log probabilities given by a linear transformation of
the decoder hidden state. When these probabilities are normalized (i.e. by passing them through
a softmax), they define a distribution over the vocabulary, indicating the most probable characters
for that time step.
The model is trained via a cross-entropy loss between the decoder distribution and ground-truth
at each time step. A common practice used to train NMT models is to feed in the ground-truth
token from the previous time step to condition the decoder output in the current step. This
training procedure is known as “teacher-forcing” and is shown in Figure 1. At test time, we don’t
have access to the ground-truth output sequence, so the decoder must condition its output on the
token it generated in the previous time step, as shown in Figure 2.
A common choice for the encoder and decoder RNNs is the Long Short-Term Memory (LSTM)
architecture [3]. The forward pass of a LSTM unit is defined by the following equations:
4Note that for the English-to-Pig-Latin task, the input and output sequences share the same vocabulary; this is
not always the case for other translation tasks (i.e., between languages that use different alphabets)
5https://norvig.com/ngrams/
4
CSC413/2516
c a t a t c a y
a t c a y
Encoder Decoder
Training
Figure 1: Training the NMT encoder-decoder architecture.
c a t
a t c a y
Encoder Decoder
Generation
Figure 2: Generating text with the NMT encoder-decoder architecture.
ft = σ(Wifxt + bif + Whfht−1 + bhf ) (1)
it = σ(Wiixt + bii + Whiht−1 + bhi) (2)

t = tanh(Wicxt + bic + Whcht−1 + bhc) (3)
Ct = ft ∗ Ct−1 + it ∗ C˜
t (4)
ot = σ(Wioxt + bio + Whoht−1 + bho) (5)
ht = ot ∗ tanh(Ct) (6)
where ∗ is the element-wise multiplication, σ is the sigmoid activation function, xt
is the input
for the timestep t, ht−1 is the hidden state from the previous timestep and W and b represent
weight matrices and biases respectively. The Gated Recurrent Unit (GRU) [1] can be seen as a
simplification of the LSTM unit, as it combines the forget and input gates (ft and it) into a single
update gate (zt) and merges the cell state (Ct) and hidden state (ht)
5
CSC413/2516
zt = σ(Wizxt + Whzht−1 + bz) (7)
rt = σ(Wirxt + Whrht−1 + br) (8)

t = tanh(Wihxt + rt ∗ (Whhht−1 + bg)) (9)
ht = (1 − zt) ∗ ht−1 + zt ∗ h˜
t (10)
See Understanding LSTM Networks for an excellent overview of LSTMs and GRUs. In this assignment, we will provide you with an implementation of an LSTM cell and ask you to implement
a GRU cell. Please open https://colab.research.google.com/github/uoft-csc413/2022/
blob/master/assets/assignments/nmt.ipynb on Colab and answer the following questions.
1. [1pt] Code a GRU Cell. Although PyTorch has a built in GRU implementation (nn.GRUCell),
we’ll implement our own GRU cell from scratch, to better understand how it works. Complete
the __init__ and forward methods of the MyGRUCell class, by implementing the above
equations.
Train the RNN encoder/decoder model on both datasets. We’ve provided implementations
for recurrent encoder/decoder models using your GRU cell. (Make sure you have run all the
relevant previous cells to load the training and utility functions.)
At the end of each epoch, the script prints training and validation losses and the Pig-Latin
translation of a fixed sentence, “the air conditioning is working”, so that you can see how the
model improves qualitatively over time. The script also saves several items:
• The best encoder and decoder model parameters, based on the validation loss.
• A plot of the training and validation losses.
After the models have been trained on both datasets, pig_latin_small and pig_latin_large,
run the save_loss_comparison_gru method, which compares the loss curves of the two models. Then, answer the following questions in 3 sentences or less:
• Does either model perform significantly better?
• Why might this be the case?
2. [0.5pt] Identify failure modes. After training, pick the best model and use it to translate test
sentences using the translate_sentence function. Try new words by changing the variable
TEST_SENTENCE. Identify a distinct failure mode and describe it in 3 sentences or less.
3. [0.5pt] Comparing complexity. Consider a LSTM and GRU encoder, each with an H dimensional hidden state and an input sequence with V vocabulary size, D embedding features size,
and K length.
6
CSC413/2516
• What are the total number of parameters of the LSTM encoder?
• What are the total number of parameters of the GRU encoder?
Provide your answer in terms of H, V , D and K (you may not need all terms). For simplicity,
assume the input to the encoders has already been embedded and ignore the embedding layer
in your parameter count. You should also ignore the bias units. You may find it useful to
read through the PyTorch documentation for LSTMCell and GRUCell before answering.
Deliverables
Create a section in your report called Part 1: Neural machine translation (NMT). Add the
following in this section:
1. Answer to question 1. Include the completed MyGRUCell. Either the code or a screenshot of
the code. Make sure both the init and forward methods are visible. Also include your
answer to the two questions in three sentences or less. [1pt]
2. Answer to question 2. Make sure to include input-output pair examples for the failure case you
identify. Your answer should not exceed three sentences in total (excluding the input-output
pair examples.) [0.5pt]
3. Answer to question 3. Your answer should not exceed one sentence in length. [0.5pt]
7
CSC413/2516
Part 2.1: Additive Attention [1pt]
Attention allows a model to look back over the input sequence, and focus on relevant input tokens
when producing the corresponding output tokens. For our simple task, attention can help the
model remember tokens from the input, e.g., focusing on the input letter c to produce the output
letter c.
The hidden states produced by the encoder while reading the input sequence, h
enc
1
, . . . , henc
T
can
be viewed as annotations of the input; each encoder hidden state h
enc
i
captures information about
the i
th input token, along with some contextual information. At each time step, an attention-based
decoder computes a weighting over the annotations, where the weight given to each one indicates
its relevance in determining the current output token.
In particular, at time step t, the decoder computes an attention weight α
(t)
i
for each of the
encoder hidden states h
enc
i
. The attention weights are defined such that 0 ≤ α
(t)
i ≤ 1 and P
i α
(t)
i =
1. α
(t)
i
is a function of an encoder hidden state and the previous decoder hidden state, f(h
dec
t−1
, henc
i
),
where i ranges over the length of the input sequence.
There are a few engineering choices for the possible function f. In this assignment, we will
investigate two different attention models: 1) the additive attention using a two-layer MLP and 2)
the scaled dot product attention, which measures the similarity between the two hidden states.
To unify the interface across different attention modules, we consider attention as a function
whose inputs are triple (queries, keys, values), denoted as (Q, K, V ).
In the additive attention, we will learn the function f, parameterized as a two-layer fullyconnected network with a ReLU activation. This network produces unnormalized weights ˜α
(t)
i
that
are used to compute the final context vector.
Decoder Hidden States Encoder Hidden States
batch_size
batch_size
seq_len
hidden_size hidden_size
batch_size
seq_len
1
Attention Weights
Figure 3: Dimensions of the inputs, Decoder Hidden States (query), Encoder Hidden States
(keys/values) and the attention weights (α
(t)
).
8
CSC413/2516
In the forward pass, we are given a batch of queries, Q of the current time step, which has
dimensions batch_size x hidden_size, and a batch of keys K and values V for each time step of
the input sequence, both have dimensions batch_size x seq_len x hidden_size. The goal is to
obtain the context vector. We first compute the function f(Qt
, K) for each query in the batch and
all corresponding keys Ki
, where i ranges over seq_len different values. Since f(Qt
, Ki) is a scalar,
the resulting tensor of attention weights has dimension batch_size x seq_len x 1. Some of the
important tensor dimensions in the AdditiveAttention module are visualized in Figure 3. The
AdditiveAttention module returns both the context vector batch_size x 1 x hidden_size and
the attention weights batch_size x seq_len x 1.
1. [0pt] Read how the provided forward methods of the AdditiveAttention class computes
α˜
(t)
i
, α
(t)
i
, ct
.
2. [0pt] The notebook provides all required code to run the additive attention model. Run the
notebook to train a language model that has additive attention in its decoder. Find one
training example where the decoder with attention performs better than the decoder without
attention. Show the input/outputs of the model with attention, and the model without
attention that you’ve trained in the previous section.
3. [1pt] How does the training speed compare? Why? Explain your answer in no more than
three sentences.
4. [0pt] Given an input sequence of length K and D embedding features size, assume the
RNNAttentionDecoder uses this input to generate an output sequence of length K, which
has V vocabulary size. Write down the number of LSTM units in RNNAttentionDecoder and
the number of connections in the above computation, as a function of hidden state size H, V ,
D, and K. Assume the attention network is parameterized as in AdditiveAttention. For
simplicity, you may ignore the bias units. You may also ignore the embedding process in your
computations. However, do include the connections associated with the output layer.
9
CSC413/2516
Deliverables
Create a section called Additive Attention. Add the following in this section:
• Answer to question 3. [1pt]
Part 2.2: Scaled Dot Product Attention [4pt]
1. [0.5pt] In lecture, we learnt about Scaled Dot-product Attention used in the transformer
models. The function f is a dot product between the linearly transformed query and keys
using weight matrices Wq and Wk:
α˜
(t)
i = f(Qt
, Ki) = (WqQt)
T
(WkKi)

d
,
α
(t)
i = softmax(˜α
(t)
)i
,
ct =
X
T
i=1
α
(t)
i WvVi
,
where, d is the dimension of the query and the Wv denotes weight matrix project the value
to produce the final context vectors.
Implement the scaled dot-product attention mechanism. Fill in the forward methods of the ScaledDotAttention class. Use the PyTorch torch.bmm (or @) to compute the
dot product between the batched queries and the batched keys in the forward pass of the
ScaledDotAttention class for the unnormalized attention weights.
The following functions are useful in implementing models like this. You might find it useful
to get familiar with how they work. (click to jump to the PyTorch documentation):
• squeeze
• unsqueeze
• expand as
• cat
• view
• bmm (or @)
Your forward pass needs to work with both 2D query tensor (batch_size x (1) x hidden_size)
and 3D query tensor (batch_size x k x hidden_size).
10
CSC413/2516
2. [0.5pt] Implement the causal scaled dot-product attention mechanism. Fill in the
forward method in the CausalScaledDotAttention class. It will be mostly the same as
the ScaledDotAttention class. The additional computation is to mask out the attention
to the future time steps. You will need to add self.neg_inf to some of the entries in the
unnormalized attention weights. You may find torch.tril handy for this part.
11
CSC413/2516
Figure 4: The transformer architecture. [9]
12
CSC413/2516
3. [0.5pt] We will train a model using the ScaledDotAttention mechanism as an encoder and decoder. Run the section AttentionEncoder and AttentionDecoder classes as well as the training block. Comment on the how performance of the network compares to the RNNAttention
model. Why do you think the performance is better or worse?
4. [0.5pt] We will now use ScaledDotAttention as the building blocks for a simplified transformer [9] encoder.
The encoder looks like the left half of Figure 4. The encoder consists of three components:
• Positional encoding: To encode the position of each word, we add to its embedding a
constant vector that depends on its position:
pth word embedding = input embedding + positional encoding(p)
We follow the same positional encoding methodology described in [9]. That is we use
sine and cosine functions:
PE(pos, 2i) = sin pos
100002i/dmodel
(11)
PE(pos, 2i + 1) = cos pos
100002i/dmodel
(12)
Since we always use the same positional encodings throughout the training, we pregenerate all those we’ll need while constructing this class (before training) and keep
reusing them throughout the training.
• A ScaledDotAttention operation.
• A following MLP.
For this question, describe why we need to represent the position of each word through this
positional encoding in one or two sentences. Additionally, describe the advantages of using
this positional encoding method, as opposed to other positional encoding methods such as a
one hot encoding in one or two sentences.
5. [1pt] The TransformerEncoder and TransformerDecoder modules have been completed for
you. Train the language model with transformer based encoder/decoder using the first configuration (hidden size 32, small dataset). How do the translation results compare to the
RNNAttention and single-block Attention decoders? Write a short, qualitative analysis.
Your answer should not exceed three sentences for each decoder (six total).
6. [1pt] In the code notebook, we have provided an experimental setup to evaluate the performance of the Transformer as a function of hidden size and data set size. Run the Transformer
model using hidden size 32 versus 64, and using the small versus large dataset (in total, 4
runs). We suggest using the provided hyper-parameters for this experiment.
13
CSC413/2516
Run these experiments, and report the effects of increasing model capacity via the hidden
size, and the effects of increasing dataset size. In particular, report your observations on how
loss as a function of gradient descent iterations is affected, and how changing model/dataset
size affects the generalization of the model. Are these results what you would expect?
In your report, include the two loss curves output by save_loss_comparison_by_hidden
and save_loss_comparison_by_dataset, the lowest attained validation loss for each run,
and your response to the above questions.
7. [0pt] The decoder includes the additional CausalScaledDotAttention component. Take a
look at Figure 4. The transformer solves the translation problem using layers of attention
modules. In each layer, we first apply the CausalScaledDotAttention self-attention to the
decoder inputs followed by ScaledDotAttention attention module to the encoder annotations, similar to the attention decoder from the previous question. The output of the attention
layers are fed into an hidden layer using ReLU activation. The final output of the last transformer layer are passed to the self.out to compute the word prediction. To improve the
optimization, we add residual connections between the attention layers and ReLU layers.
Modify the transformer decoder __init__ to use non-causal attention for both self attention
and encoder attention. What do you observe when training this modified transformer? How
do the results compare with the causal model? Why?
8. [0pt] What are the advantages and disadvantages of using additive attention vs scaled dotproduct attention? List one advantage and one disadvantage for each method.
Deliverables
Create a section in your report called Scaled Dot Product Attention. Add the following:
• Screenshots of your ScaledDotProduct, CausalScaledDotProduct implementations. Highlight the lines you’ve added. [1pt]
• Your answer to question 3. [0.5pt]
• Your answer to question 4. [0.5pt]
• Your response to question 5. Your analysis should not exceed six sentences. [0.5pt]
• The two loss curves plots output by the experimental setup in question 6, and the lowest
validation loss for each run. [1pt]
• Your response to the written component of question 6. [0.5pt]
14
CSC413/2516
Part 3: Fine-tuning Pretrained Language Models (LMs) [2pt]
The previous sections had you train models from scratch. However, similar to computer vision
(CV), it is now very common in natural language processing (NLP) to fine-tune pretrained models.
Indeed, this has been described as “NLP’s ImageNet moment.”6
In this section, we will learn how
to fine-tune pretrained language models (LMs) on a new task. We will use a simple classification
task, where the goal is to determine whether a verbal numerical expression is negative (label 0),
zero (label 1), or positive (label 2). For example, “eight minus ten” is negative, so our classifier
should output label index 0. As our pretrained LM, we will use the popular BERT model, which
uses a transformer encoder architecture similar to the TransformerEncoder from Part 3. More
specifically, we will explore two versions of BERT: MathBERT [7], which has been pretrained on
a large mathematical corpus ranging from pre-kindergarten to college graduate level mathematical
content and BERTweet [4], which has been pretrained on 100s of millions of tweets.
Most of the code is given to you in the notebook https://colab.research.google.com/
github/uoft-csc413/2022/blob/master/assets/assignments/bert.ipynb. The starter code
uses the HuggingFace Transformers library7
, which has more than 50k stars on GitHub due to its
ease of use, and will be very useful for your NLP research or projects in the future. Your task is
to adapt BERT so that it can be fine-tuned on our downstream task. Before starting this section,
please carefully review the background for BERT and the verbal arithmetic dataset (below).
Background
BERT
Bidirectional Encoder Representations from Transformers (BERT) [2] is a LM based on the Transformer [9] encoder architecture that has been pretrained on a large dataset of unlabeled sentences
from Wikipedia and BookCorpus [10]. Given a sequence of tokens, BERT outputs a “contextualized
representation” vector for each token. Because BERT is pretrained on a large amount of text, these
contextualized representations encode useful properties of the syntax and semantics of language.
BERT has 2 pretraining objectives: (1) Masked Language Modeling (MLM), and (2) Next
Sentence Prediction (NSP). The input to the model is a sequence of tokens of the form:
[CLS] Sentence A [SEP] Sentence B
where [CLS] (“class”) and [SEP] (“separator”) are special tokens. In MLM, some percentage of the
input tokens are randomly “masked” by replacing them with the [MASK] token, and the objective
is to use the final layer representation for that masked token to predict the correct word that was
masked out8
. In NSP, the task is to use the contextualized representation of the [CLS] token to
6
https://ruder.io/nlp-imagenet/
7
https://huggingface.co/docs/transformers
8The actual training setup is slightly more complicated but conceptually similar. Notice, this is similar to one of
the models in Programming Assignment 1!
15
CSC413/2516
BERT BERT
E[CLS] E1
E[SEP] … EN
E1
’ … EM

C T1 T[SEP] … TN
T1
’ … TM

[CLS] Tok 1 [SEP] … Tok N Tok 1 … TokM
Question Paragraph
Start/End Span
BERT
E[CLS] E1
E[SEP] … EN
E1
’ … EM

C T1 T[SEP] … TN
T1
’ … TM

[CLS] Tok 1 [SEP] … Tok N Tok 1 … TokM
Masked Sentence A Masked Sentence B
Pre-training Fine-Tuning
NSP Mask LM Mask LM
Unlabeled Sentence A and B Pair
SQuAD
Question Answer Pair
MNLI NER
Figure 5: Overall pretraining and fine-tuning for BERT. Reproduced from BERT paper [2]
predict whether sentence A and sentence B are consecutive sentences in the unlabeled dataset. See
Figure 5 for the conceptual picture of BERT pretraining and fine-tuning.
Once pretrained, we can fine-tune BERT on a downstream task of interest, such as sentiment
analysis or question-answering, benefiting from its learned contextual representations. Typically,
this is done by adding a simple classifier, which maps BERTs outputs to the class labels for our
downstream task. Often, this classifier is a single linear layer + softmax. We can choose to train
only the parameters of the classifier, or we can fine-tune both the classifier and BERT model jointly.
Because BERT has been pretrained on a large amount of data, we can get good performance by
fine-tuning for a few epochs with only a small amount of labelled data.
In this assignment, you will fine-tune BERT on a single sentence classification task.
Figure 6 illustrates the basic setup for fine-tuning BERT on this task. We prepend the tokenized
sentence with the [CLS] token, then feed the sequence into BERT. We then take the contextualized
[CLS] token representation at the last layer of BERT as input to a simple classifier, which will
learn to predict the probabilities for each of the possible output classes of our task. We will use
the pretrained weights of MathBERT, which uses the same architecture as BERT, but has been
pretrained on a large mathematical corpus, which more closely matches our task data (see below).
16
CSC413/2516
%(57
(>6(3@ 1 (
¶  (0¶
&>6(3@ 1 7
¶  70¶
>(3@ 7RN
1
7RN
  7RN
0
4XHVWDJUDSK
%(57
(>&/6@ ( (
(1
& 7 7
71
6LQJOH6HQWHQFH


%(57
>&/6@ 7RN 7RN  7RN1
(>&/6@ ( (
(1
& 7 7
71
6LQJOH6HQWHQFH
2 %3(5 2

(  >&/6@ ( (>6(3
&OD
 (1 7>6(3@  71 7
 70¶
W(QG6SDQ
&ODVV
/DEHO
%(57
>&/6@ 7RN >&/6@ 7RN 7RN
  >6(3@ 7RN
1
7RN
 
0
6HQWHQFH

6HQWHQFH
Figue-tuning BERT for single sentence classification by adding a layer on top of the
contextualized [CLS] token representation. Reproduced from BERT paper [2]
Verbal Arithmetic Dataset
The verbal arithmetic dataset contains pairs of input sentences and labels. The input sentences
express a simple addition or subtraction. Each input is labelled as 0, 1, or 2 if it evaluates to
negative, zero, or positive, respectively. There are 640 examples in the train set and 160 in the test
set. All inputs have only three tokens similar to the examples shown below:
Input expression Label Label meaning
four minus ten 0 “negative”
eighteen minus eighteen 1 “zero”
four plus seven 2 “positive”
Questions:
1. [1pt] Add a classifier to BERT. Open the notebook https://colab.research.google.com/
github/uoft-csc413/2022/blob/master/assets/assignments/bert.ipynb and complete
Question 1 by filling in the missing lines of code in BertForSentenceClassification.
2. [0pt] Fine-tune BERT. Open the notebook and run the cells under Question 2 to fine-tune
the BERT model on the verbal arithmetic dataset. If question 1 was completed correctly, the
model should train, and a plot of train loss and validation accuracy will be displayed.
17