Sale!

ECE-GY 9123 Homework 5 Solved

Original price was: $40.00.Current price is: $35.00. $29.75

Category:

Description

5/5 - (1 vote)

ECE-GY 9123 Homework 5

1. (3 points) Policy gradients. In class we derived a general form of policy gradients. Let us
consider a special case here. Suppose the step size is η. We consider where past actions and
states do not matter; different actions ai give rise to different rewards Ri

 

a. Define the mapping π such that π(ai) = softmax(θi) for i = 1, . . . , k, where k is the
total number of actions and θi
is a scalar parameter encoding the value of each action.

 

Show that if action ai
is sampled, then the change in the parameters in REINFORCE is
given by:
∆θi = ηRi(1 − π(ai)).

 

b. Intuitively explain the dynamics of the above gradient updates.

2. (3 points) Designing rewards in Q-learning. Suppose we are trying to solve a maze with a goal
and a (stationary) monster in some location, and the goal is to reach the goal in the minimum
number of moves.

We are tasked with designing a suitable reward function for Q-learning.

 

There are two options:
a. We declare a reward of +2 for reaching the goal, -1 for running into a monster, and 0 for
every other move.
b. We declare a reward of +1.5 for reaching the goal, -1.5 for running into a monster, and
-0.5 for every other move.

 

Which of these reward functions might lead to better policies?
(Hint: For a general case, how does the expected discounted return change if a constant offset
is added to all rewards?)

 

3. (4 points) Open the (incomplete) Jupyter notebook provided as an attachment to this homework
in Google Colab (or other environment of your choice) and complete the missing items.

 

Save
your finished notebook in PDF format and upload along with your answers to the above theory
questions in a single PDF.
1