Sale!

Homework 4 Transformer CSCI-GA 2572 Solved

Original price was: $40.00.Current price is: $35.00.

Category:

Description

5/5 - (1 vote)

Transformer

1 Theory (50pt)
1.1 Attention (13pts)

This question tests your intuitive understanding of attention and its property.
(a) (1pts) Given queries Q ∈ R
d×n
, K ∈ R
d×m and V ∈ R
t×m, what is the output
H of the standard dot-product attention? (You can use the softargmaxβ
function directly. It is applied to the column of each matrix).

(b) (2pts) Explain how the scale β influence the output of the attention? And
what β is conveniently to use?

(c) (3pts) One advantage of the attention operation is that it is really easy to
preserve a value vector v to the output h. Explain in what situation, the
outputs preserves the value vectors. Also, what should the scale β be if we
just want the attention operation to preserve value vectors. Which of the
four types of attention we are referring to? How can this be done when
using fully connected architectures?

(d) (3pts) On the other hand, the attention operation can also dilute different
value vectors v to generate new output h. Explain in what situation the
outputs is spread version of the value vectors. Also, what should the scale
β be if we just want the attention operation to diffuse as much as possible.
Which of the four types of attention we are referring to? How can this be
done when using fully connected architectures?

(e) (2pts) If we have a small perturbation to one of the k (you could assume
the perturbation is a zero-mean Gaussian with small variance, so the new
kˆ = k +ϵ), how will the output of the H change?

(f) (2pts) If we have a large perturbation that it elongates one key so the kˆ =
αk for α > 1, how will the output of the H change?

1.2 Multi-headed Attention (3pts)

This question tests your intuitive understanding of Multi-headed Attention and
its property.
(a) (1pts) Given queries Q ∈ R
d×n
, K ∈ R
d×m and V ∈ R
t×m, what is the output
H of the standard multi-headed scaled dot-product attention? Assume we
have h heads.

(b) (2pts) Is there anything similar to multi-headed attention for convolutional
networks? Explain why do you think they are similar. (Hint: read the
conv1d document from PyTorch: link)

1.3 Self Attention (11pts)

This question tests your intuitive understanding of Self Attention and its property.
(a) (2pts) Given an input C ∈ R
e×n
, what is the queries Q, the keys K and the
values V and the output H of the standard multi-headed scaled dot-product
self-attention? Assume we have h heads. (You can name and define the
weight matrices by yourself)

(b) (2pts) Explain when we need the positional encoding for self-attention and
when we don’t. (You can read about it at link)

(c) (2pts) Show us one situation that the self attention layer behaves like an
identity layer or permutation layer.

(d) (2pts) Show us one situation that the self attention layer behaves like an
“running” linear layer (it applies the linear projection to each location).
What is the proper name for a “running” linear layer.

(e) (3pts) Show us one situation that the self attention layer behaves like an
convolution layer with a kernel larger than 1. You can assume we use
positional encoding.

1.4 Transformer (15pts)

Read the original paper on the Transformer model: “Attention is All You Need”
by Vaswani et al. (2017).
(a) (3pts) Explain the primary differences between the Transformer architecture and previous sequence-to-sequence models (such as RNNs and LSTMs).
(b) (3pts) Explain the concept of self-attention and its importance in the Transformer model.
(c) (3pts) Describe the multi-head attention mechanism and its benefits.
(d) (3pts) Explain the feed-forward neural networks used in the model and
their purpose.
(e) (3pts) Describe the layer normalization technique and its use in the Transformer architecture.

1.5 Vision Transformer (8pts)

Read the paper on the Transformer model: “An Image is Worth 16 × 16 Words:
Transformers for Image Recognition at Scale”.
(a) (2pts) What is the key difference between the Vision Transformer (ViT)
and traditional convolutional neural networks (CNNs) in terms of handling
input images? Can you spot a convolution layer in the ViT architecture?

(b) (2pts) Explain the differences between the Vision Transformer and the
Transformer introduced in the original paper.

(c) (2pts) What is the role of positional embeddings in the Vision Transformer
model, and how do they differ from positional encodings used in the original
Transformer architecture?

(d) (2pts) How does the Vision Transformer model generate the final classification output? Describe the process and components involved in this step.

2 Implementation (50pt)

Please add your solutions to this notebook HW4-VIT-Student.ipynb . Plase use
your NYU account to access the notebook. The notebook contains parts
marked as TODO, where you should put your code or explanations.

The notebook
is a Google Colab notebook, you should copy it to your drive, add your solutions,
and then download and submit it to NYU Classes. You’re also free to run it on any
other machine, as long as the version you send us can be run on Google Colab.
4 Transformer