Sale!

Homework 3 Energy-Based Models CSCI-GA 2572 Solved

Original price was: $40.00.Current price is: $35.00.

Category:

Description

5/5 - (1 vote)

Energy-Based Models
1 Theory (50pt)
1.1 Energy Based Models Intuition (15pts)

This question tests your intuitive understanding of Energy-based models and
their properties.
(a) (1pts) How do energy-based models allow for modeling situations where the
mapping from input xi to output yi
is not 1 to 1, but 1 to many?

(b) (2pts) How do energy-based models differ from models that output probabilities?

(c) (2pts) How can you use energy function FW (x, y) to calculate a probability
p(y | x)?

(d) (2pts) What are the roles of the loss function and energy function?

(e) (2pts) What problems can be caused by using only positive examples for energy (pushing down energy of correct inputs only)? How can it be avoided?

(f) (2pts) Briefly explain the three methods that can be used to shape the energy function.

(g) (2pts) Provide an example of a loss function that uses negative examples.
The format should be as follows ℓexample(x, y,W) = FW (x, y).

(h) (2pts) Say we have an energy function F(x, y) with images x, classification
for this image y. Write down the mathematical expression for doing inference given an input x. Now say we have a latent variable z, and our energy
is G(x, y, z). What is the expression for doing inference then?

1.2 Negative log-likelihood loss (20 pts)

Let’s consider an energy-based model we are training to do classification of input
between n classes. FW (x, y) is the energy of input x and class y. We consider n
classes: y ∈ {1,…,n}.
(i) (2pts) For a given input x, write down an expression for a Gibbs distribution
over labels y that this energy-based model specifies. Use β for the constant
multiplier.

(ii) (5pts) Let’s say for a particular data sample x, we have the label y. Give the
expression for the negative log likelihood loss, i.e. negative log likelihood
of the correct label (show step-by-step derivation of the loss function from
the expression of the previous subproblem). For easier calculations in the
following subproblem, multiply the loss by 1
β

2
(iii) (8pts) Now, derive the gradient of that expression with respect to W (just
providing the final expression is not enough). Why can it be intractable to
compute it, and how can we get around the intractability?

(iv) (5pts) Explain why negative log-likelihood loss pushes the energy of the
correct example to negative infinity, and all others to positive infinity, no
matter how close the two examples are, resulting in an energy surface with
really sharp edges in case of continuous y (this is usually not an issue for
discrete y because there’s no distance measure between different classes).

1.3 Comparing Contrastive Loss Functions (15pts)

In this problem, we’re going to compare a few contrastive loss functions. We are
going to look at the behavior of the gradients, and understand what uses each
loss function has. In the following subproblems, m is a margin, m ∈ R, x is input,
y is the correct label, y¯ is the incorrect label.

Define the loss in the following
format:

ℓexample(x, y, y¯,W) = FW (x, y).
(a) (3pts) Simple loss function is defined as follows:
ℓsimple(x, y, y¯,W) = [FW(x, y)]
+ +[m− FW (x, y¯)]
+

Assuming we know the derivative ∂FW (x,y)
∂W
for any x, y, give an expression
for the partial derivative of the ℓsimple with respect to W.
(b) (3pts) Log loss is defined as follows:
ℓlog(x, y, y¯,W) = log³
1+ e
FW (x,y)−FW (x,y¯)
´

Assuming we know the derivative ∂FW (x,y)
∂W
for any x, y, give an expression
for the partial derivative of the ℓlog with respect to W.
(c) (3pts) Square-Square loss is defined as follows:
ℓsquare-square(x, y, y¯,W) =
¡

[FW (x, y)]
+
¢2
+
¡
[m− FW (x, y¯)]
+
¢2

Assuming we know the derivative ∂FW (x,y)
∂W
for any x, y, give an expression
for the partial derivative of the ℓsquare-square with respect to W.
(d) (6pts) Comparison.

(i) (2pts) Explain how NLL loss is different from the three losses above.

(ii) (2pts) The hinge loss [FW(x, y)− FW (x, y¯)+ m]
+ has a margin parameter m, which gives 0 loss when the positive and negative examples
have energy that are m apart. The log loss is sometimes called a “softhinge” loss. Why? What is the advantage of using a soft hinge loss?
3

(iii) (2pts) How are the simple loss and square-square loss different from
the hinge/log loss? In what situations would you use the simple loss,
and in what situations would you use the square-square loss?

2 Implementation (50pt)

Please add your solutions to this notebook hw3_impl.ipynb . Plase use your
NYU account to access the notebook. The notebook contains parts marked as
TODO, where you should put your code or explanations.

The notebook is a Google
Colab notebook, you should copy it to your drive, add your solutions, and then
download and submit it to NYU Classes. You’re also free to run it on any other
machine, as long as the version you send us can be run on Google Colab.
4