## Description

1 Recurrent Neural Network

1. [Vanilla RNN for parity function: 4 points]

Let us define a sequence parity function as a function that takes in a sequence of binary inputs

and returns a sequence indicating the number of 1’s in the input so far; specifically, if at time

t the 1’s in the input so far is even it returns 1, and 0 if it is odd. For example, given input

sequence [0, 1, 0, 1, 1, 0], the parity sequence is [1, 0, 0, 1, 0, 0]. Let xt denote the input at time

t and yt be the boolean output of this parity function at time t.

Design a vanilla recurrent neural network to implement the parity function. Your implementation should include the equations of your hidden units and the details about activations

with different input at xt (0 or 1).

Hint: Recall that we have implemented AND and OR functions in problem set 2, which may

be useful here.

2. [LSTM for parity function: 4 points]

Let us recall different gates in an LSTM Network. First gate is the “forget gate layer”:

ft = σ(Wf .[ht−1, xt

] + bf ) (1)

where ft

is the output of forget gate, Wf is the weight matrix, ht−1 is the hidden state of step

t-1, xt

is the current input and bt

is the bias value.

Next we have “input gate layer”:

it = σ(Wi

.[ht−1, xt

] + bi) (2)

C˜

t = tanh(WC.[ht−1, xt

] + bC) (3)

where it decides which values we will update and C˜

t are the new candidate values that could

be added to the cell state. Next we have new cell state candidate values:

Ct = ft ∗ Ct−1 + it ∗ C˜

t (4)

Finally, we have the output and hidden state

ot = σ(Wo.[ht−1, xt

] + bo) (5)

ht = ot ∗ tanh(Ct) (6)

Design an LSTM Network for the bit parity problem mentioned in Question 1. Specifically,

provide values for Wf , bf , Wi

, bi

, WC, bC, Wo and bo such that the cell state Ct store the

parity of bit string. Please mention any assumptions you make. For this problem, you can

assume below for Sigmoid and tanh function:

σ(x) = (

1, if x > 0

0, otherwise

(7)

tanh(x) =

1, if x > 0

0, x = 0

−1, if x < 0

(8)

2 CS4803/7643

Hint: Recall that XOR of x and y can be represented as (x ∧ y¯) ∨ (x¯ ∧ y). Think about how

you can leverage this information for equation (4).

3. [When to stop in beam search?: 4 points]

Beam Search is a technique widely used to improve the output text quality of neural text

generation. But it is difficult to decide when to stop beam search to obtain optimality because

hypotheses can finish in different steps. Assume an input x from which we generate an output

sentence y which is a completed hypothesis:

y∗ = argmax

y:comp(y)

Y

i≤|y|

p(yi

|x, y<i) (9)

where y hypothesis. We say that a hypothesis y is completed (comp(y)), if its last word is , i.e.,

comp(y)

def = (y|y| = )

in which case it will not be further expanded.

Given Bi−1 is an b length ordered list, consisting of pairs hy, si of the current prefix y and

the accumulated score (product of probabilities) s , we use beam search to approximately find

the best output y

∗

. Beam search can be defined as an operator topb

that selects the top b

scoring items:

Bi =

b

top{hy

0

◦ yi

, s·p(yi

|x, y)i | hy

0

, si ∈ Bi−1} (10)

Let us define a beam search algorithm that will always return the optimal score complete

hypothesis (modulo beam size) and finish as soon as optimality is reached. Let best≤i be the

best completed hypothesis so far up to step i, i.e.

best≤i

∆=maxn

y ∈ ∪j≤iBj |comp(y)

o

(11)

We update it every time we find a completed hypothesis (if there is none yet, then it remains

undefined). Given that at any step i, if best≤i

is defined and the highest scoring item in the

current beam Bi scores worse than or equal to best≤i

, prove that the current best completed

hypothesis (obtained from best≤i) is the overall highest-probability completed hypothesis and

future steps will be no better than the best completed hypothesis.

4. [Exploding Gradients: 4 points; Extra credit for 4803]

Learning long-term dependencies in recurrent networks suffers from a particular numerical

challenge – gradients propagated over many time-steps tend to either ‘vanish’ (i.e. converge

to 0, frequently) or ‘explode’ (ı.e. diverge to infinity; rarely, but with more damage to the

optimization). To study this problem in a simple setting, consider the following recurrence

relation without any nonlinear activation function or input x:

ht = W>ht−1 (12)

where W is a weight sharing matrix for recurrent relation at any time t. Let λ1, …, λn be the

eigenvalues of the weight matrix W ∈ C

n×n

. Its spectral radius ρ(W) is defined as:

3

ρ(W) = max{|λ1|, …, |λn|} (13)

Assuming the initial hidden state is h0, write the relation between hT and h0 and explain the

role of the eigenvalues of W in determining the ‘vanishing’ or ‘exploding’ property as T 0

5. [Paper Review: 4 points]

Explainability is an increasingly important area of deep learning research today. The following paper, from CVPR 2018, introduces a unique multi-modal approach to explainability by

joint textual rationale generation and attention visualization in the task of visual question answering (VQA). This is a major improvement over pre-existing unimodal explainable models,

with the results showing complementary explanatory strengths from the visual and textual

explanations. The paper can be accessed here.

As in the previous assignments, please limit your reviews to 350 words.

Briefly summarize the key findings, strengths and potential limitations of this work.

2 Coding: Sequence models for image captioning [55 regular points

for CS7643, 45 regular points for CS4803]

The coding part of this assignment will consist of implementation of sequence models for captioning images. To get started, go to https://www.cc.gatech.edu/classes/AY2020/cs7643_spring/

Z3o9P26CwTPZZMDXyWYDj3/hw3/

4