Description
1 Recurrent Neural Network
1. [Vanilla RNN for parity function: 4 points]
Let us define a sequence parity function as a function that takes in a sequence of binary inputs
and returns a sequence indicating the number of 1’s in the input so far; specifically, if at time
t the 1’s in the input so far is even it returns 1, and 0 if it is odd. For example, given input
sequence [0, 1, 0, 1, 1, 0], the parity sequence is [1, 0, 0, 1, 0, 0]. Let xt denote the input at time
t and yt be the boolean output of this parity function at time t.
Design a vanilla recurrent neural network to implement the parity function. Your implementation should include the equations of your hidden units and the details about activations
with different input at xt (0 or 1).
Hint: Recall that we have implemented AND and OR functions in problem set 2, which may
be useful here.
2. [LSTM for parity function: 4 points]
Let us recall different gates in an LSTM Network. First gate is the “forget gate layer”:
ft = σ(Wf .[ht−1, xt
] + bf ) (1)
where ft
is the output of forget gate, Wf is the weight matrix, ht−1 is the hidden state of step
t-1, xt
is the current input and bt
is the bias value.
Next we have “input gate layer”:
it = σ(Wi
.[ht−1, xt
] + bi) (2)
C˜
t = tanh(WC.[ht−1, xt
] + bC) (3)
where it decides which values we will update and C˜
t are the new candidate values that could
be added to the cell state. Next we have new cell state candidate values:
Ct = ft ∗ Ct−1 + it ∗ C˜
t (4)
Finally, we have the output and hidden state
ot = σ(Wo.[ht−1, xt
] + bo) (5)
ht = ot ∗ tanh(Ct) (6)
Design an LSTM Network for the bit parity problem mentioned in Question 1. Specifically,
provide values for Wf , bf , Wi
, bi
, WC, bC, Wo and bo such that the cell state Ct store the
parity of bit string. Please mention any assumptions you make. For this problem, you can
assume below for Sigmoid and tanh function:
σ(x) = (
1, if x > 0
0, otherwise
(7)
tanh(x) =
1, if x > 0
0, x = 0
−1, if x < 0
(8)
2 CS4803/7643
Hint: Recall that XOR of x and y can be represented as (x ∧ y¯) ∨ (x¯ ∧ y). Think about how
you can leverage this information for equation (4).
3. [When to stop in beam search?: 4 points]
Beam Search is a technique widely used to improve the output text quality of neural text
generation. But it is difficult to decide when to stop beam search to obtain optimality because
hypotheses can finish in different steps. Assume an input x from which we generate an output
sentence y which is a completed hypothesis:
y∗ = argmax
y:comp(y)
Y
i≤|y|
p(yi
|x, y<i) (9)
where y hypothesis. We say that a hypothesis y is completed (comp(y)), if its last word is , i.e.,
comp(y)
def = (y|y| = )
in which case it will not be further expanded.
Given Bi−1 is an b length ordered list, consisting of pairs hy, si of the current prefix y and
the accumulated score (product of probabilities) s , we use beam search to approximately find
the best output y
∗
. Beam search can be defined as an operator topb
that selects the top b
scoring items:
Bi =
b
top{hy
0
◦ yi
, s·p(yi
|x, y)i | hy
0
, si ∈ Bi−1} (10)
Let us define a beam search algorithm that will always return the optimal score complete
hypothesis (modulo beam size) and finish as soon as optimality is reached. Let best≤i be the
best completed hypothesis so far up to step i, i.e.
best≤i
∆=maxn
y ∈ ∪j≤iBj |comp(y)
o
(11)
We update it every time we find a completed hypothesis (if there is none yet, then it remains
undefined). Given that at any step i, if best≤i
is defined and the highest scoring item in the
current beam Bi scores worse than or equal to best≤i
, prove that the current best completed
hypothesis (obtained from best≤i) is the overall highest-probability completed hypothesis and
future steps will be no better than the best completed hypothesis.
4. [Exploding Gradients: 4 points; Extra credit for 4803]
Learning long-term dependencies in recurrent networks suffers from a particular numerical
challenge – gradients propagated over many time-steps tend to either ‘vanish’ (i.e. converge
to 0, frequently) or ‘explode’ (ı.e. diverge to infinity; rarely, but with more damage to the
optimization). To study this problem in a simple setting, consider the following recurrence
relation without any nonlinear activation function or input x:
ht = W>ht−1 (12)
where W is a weight sharing matrix for recurrent relation at any time t. Let λ1, …, λn be the
eigenvalues of the weight matrix W ∈ C
n×n
. Its spectral radius ρ(W) is defined as:
3
ρ(W) = max{|λ1|, …, |λn|} (13)
Assuming the initial hidden state is h0, write the relation between hT and h0 and explain the
role of the eigenvalues of W in determining the ‘vanishing’ or ‘exploding’ property as T 0
5. [Paper Review: 4 points]
Explainability is an increasingly important area of deep learning research today. The following paper, from CVPR 2018, introduces a unique multi-modal approach to explainability by
joint textual rationale generation and attention visualization in the task of visual question answering (VQA). This is a major improvement over pre-existing unimodal explainable models,
with the results showing complementary explanatory strengths from the visual and textual
explanations. The paper can be accessed here.
As in the previous assignments, please limit your reviews to 350 words.
Briefly summarize the key findings, strengths and potential limitations of this work.
2 Coding: Sequence models for image captioning [55 regular points
for CS7643, 45 regular points for CS4803]
The coding part of this assignment will consist of implementation of sequence models for captioning images. To get started, go to https://www.cc.gatech.edu/classes/AY2020/cs7643_spring/
Z3o9P26CwTPZZMDXyWYDj3/hw3/
4