## Description

1 Optimal Policy and Value Function [15 points]

Consider a simple 2-state MDP shown in Figure 1, S = {S1, S2}. From each state, there are two

available actions A = {stay, go}. The reward received for taking an action at a state is shown in

the figure for each (s, a) pair. Also, taking action go from state S2 ends the episode. All transitions

are deterministic.

Figure 1: 2-state MDP

(a) Consider a policy that always chooses the stay action at every state. Compute the sum of

discounted rewards obtained by this policy starting at state S1, assuming a discount of γ i.e.

compute P∞

t=0

γ

t

rt(st

, at). [4 points]

(b) What is the optimal policy? Write it as a tuple (a1, a2) of the optimal actions at (s1, s2)

respectively. Compute the sum of discounted rewards obtained by this policy, given that the

start state is S1, with discount γ. [4 points]

(c) Recall that one update of the value iteration algorithm performs the following operation,

where r(s, a) denotes the reward for taking action a at state s.

For each s ∈ S :

V

i+1(s) ← max

a

X

s

0

p(s

0

|s, a)

r(s, a) + γV i

(s

0

)

(1)

Note: Since taking action go at s2 will terminate the episode, the value of taking the go

action at s2 is just r(s2, go) instead of r(s2, go) + γV (s1) (since you don’t care about future

values after the episode has ended). Thus the update for state V (s2) will look like:

V

i+1(s2) ← max

r(s2, stay) + γV i

(s2), r(s2, go)

(2)

The value function V is stored as an array of size |S| ( |S| = 2 in our case). For example,

the array V

0 = [0, 1] denotes that V

0

(s1) = 0, V 0

(s2) = 1. Starting with a initial V

0 = [0, 0],

peform the value iteration update to compute V

1

, V 2

, V 3 assuming a discount factor γ = 1.

What is the optimal V

∗

? [7 points]

2 Value Iteration Convergence [25 points + 5 bonus points]

In part (c) of the previous question, we computed V

i

for a few updates of value iteration and also

V

∗

, the optimal value function. What happened to the “error” in the value estimate at every update

i.e. did |V

i

(s)−V

∗

(s)| every increase for a particular state s? The answer is yes, the value estimate

for any s may fluctuate before eventually converging to V

∗

(s). If this is the case, do we have any

hope that the value iteration algorithm will converge to V

∗

? In this question, we will prove that

value iteration indeed converges due to the monotonic decrease in a different error – the maximum

error over all states max

s∈S

|V

i

(s) − V

∗

(s)| or the L∞ norm

V

i − V

∗

∞

.

(a) For V

1

, V 2

, V 3 obtained in 1(c), compute

V

i − V

∗

∞

and verify that this error decreased

monotonically. [5 points]

(b) Let T : R

|S| → R

|S| be the function that takes the vector V

i as input and applies the value

iteration update to produce V

i+1. We will show that for any two vectors V, V 0 ∈ R

|S|, this

operator decreases the L∞ norm between them.

Prove that: kT(V ) − T(V

0

)k∞ ≤ γ kV − V

0k∞ for any V, V 0 ∈ R

|S|

, [10 points]

(c) Now consider the sequence of value functions {V

n} obtained by iteratively performing the

value iteration update as per Eq. 1. This sequence has the interesting property of being a

cauchy sequence i.e. the elements of the sequence become arbitrarily close to each other as the

sequence progresses. Formally, we say a sequence is cauchy w.r.t. the L∞ norm if for every

positive number , there is a positive integer N s.t. for all positive integers m, n greater than

N, kxm − xnk∞ < . Equipped with this fact, show that the error of V n+1 w.r.t the optimal value function can be bounded as: ∀ > 0, ∃N > 0 s.t. ∀n > N

V

n+1 − V

∗

∞

≤

γ

1 − γ

, (3)

[10 points]

Hint: First try to prove that

V

n+1 − V

∗

∞

≤

γ

1−γ

V

n − V

n+1

∞

, then apply the Cauchy

sequence condition on V

n

Discussion: Given Equation 3, we now have a guarantee that value iteration converges to V

∗

.

However, we made the assumption that {V

n} is a Cauchy sequence. As a bonus question below, we

will prove that the condition in (a) is sufficient to conclude that {V

n} is indeed a cauchy sequence.

(d) [BONUS] Given a function T : R

n → R

n

such that kT(x1) − T(x2)k∞ ≤ α kx1 − x2k∞,

α ∈ [0, 1), prove first that T has a fixed point i.e. ∃x

∗

: T(x

∗

) = x

∗ and second that the fixed

point of T is unique. [5 bonus points]

3 Learning the Model. [20 bonus points]

This is a challenging question with all sub-parts as bonus credit.

While value iteration (VI) allows us to find the optimal policy, performing VI updates (Eq. 1)

requires the transition function T(s, a) := p(s

0

| s, a)

1 and the reward function R(s, a). In many

practical situations, these quantities are unknown. However, we can step through the environment

and observe transitions to collect data of the form (s, a, s0

, r). Using this data, if we can obtain

an estimate of the transition and reward functions, we can hope to still find the optimal policy via

value iteration. Let M denote the true (unknown to us) model of the world (a model consists of

1we will denote T(s, a, s0

) as the probability value of s

0

give (s, a), whereas T(s, a) denotes the entire probability

distribution p(s

0

| s, a)

3

both a transition and reward function), and let Mˆ represent our approximate model of the world

that we estimate from data.

In this question, given estimates of both the transition function T(s, a) = p(s

0

| s, a) ∈ ∆|S|−1

and the reward function R(s, a), we want to study the error of acting optimally in this estimated

MDP as opposed to acting optimally in the “true” MDP. Naturally, we should expect this error to

be smaller as we get better estimates which in turn is proportional to how many observations we

make.

(a) [BONUS] For the first part of this problem, assume that we have used our favorite learning

method to obtain estimates of the transition function Tˆ and the reward function Rˆ. Given

that maxs,a |Rˆ(s, a) − R(s, a)| ≤ R and maxs,a ||Tˆ(s, a) − T(s, a)||1 ≤ P , then for any policy

π : S → A, ∀s ∈ S, show that [10 bonus points]

||V

π

Mˆ − V

π

M||∞ ≤

R

1 − γ

+

γP

(1 − γ)

2

(4)

Assume that the reward function is in [0, 1].

Hint: 1) Expand VM(s) using the Bellman equation 2) Holder’s Inequality: ||u · v||1 ≤

||u||1 · ||v||∞

(b) [BONUS] We just bounded the error of acting according to any policy π : S → A under the

estimated model, Mˆ vs. the true model, M. Now, we use this to bound the value error in

acting optimally vs. acting optimally according to the estimated model (π

∗

Mˆ

).

Using (4), show that [3 bonus points]

V

∗

M(s) − V

π

∗

Mˆ

M (s) ≤ 2 sup

π:S→A

||V

π

Mˆ − V

π

M||∞ (5)

(c) [BONUS] Given

R =

r

1

2n

log 4|S × A|

δ

(6)

P = |S| · r

1

2n

log 4|S × A × S|

δ

(7)

use (4) to simplify (5) i.e. V

∗

M(s)−V

π

∗

M

M (s). What is the dependence on this error and number

of observations required for each state action pair, n? [2 bonus points]

(d) [BONUS] Given a dataset Ds,a = {(s, a, s0

1

, r1), . . .(s, a, s0

n

, rn)} we estimate the unknown

transition and reward function using empirical frequencies as:

Tˆ(s, a, s0

) = 1

n

Xn

i=1

1(s

0

i = s

0

) (8)

Rˆ(s, a) = 1

n

Xn

i=1

r (9)

4

For notational convenience, we will use Tˆ(s, a) to denote a vector of size |S| whose element is

the probability computed in (8) for each next state, s

0 ∈ S. Prove that R and P take values

as in (6) and (7). [5 bonus points]

Hint: Hoeffding’s inequality2

4 Policy Gradients Variance Reduction [10 points]

In class (will be covered on Oct 29th, but you don’t really need to know the derivation for this

question), we derived the policy gradient as

∇θJ(θ) = ∇θEτ∼πθ

[R(τ )] (10)

= Eτ∼πθ

[R(τ )∇ log πθ(τ )] (11)

≈

1

N

X

N

i=1

R(τi)∇θ log πθ(τi) (12)

where τ is a trajectory3

, R(τ ), the associated reward and N, the number of trials to estimate the

expected reward, J(·), the expected reward and θ, the parameters of the policy. These gradients

are often very noisy and lead to unstable training. In this question, we show a simple method to

reduce the variance in the gradients.

(a) Show that adjusting the reward as R(τ ) := R(τ ) − b does not change the gradient estimate

in Eq. 10. [2 points]

(b) Compute the variance of ∇θJ(θ) and notice how subtracting b helped reduce it. What value

of b will lead to the least variance? [8 points]

Hint: The variance of J(θ) is written as V ar(x) = E[x

2

] − (E[x])2 where x is substituted as

R(τ )∇θ log πθ(τ )

5 Paper Review: [4 points]

In this course’s final paper review section, we examine an interesting project that upends conventional research in reinforcement learning. While most RL problems are aimed at maximizing

rewards, this work proposes ’upside down reinforcement learning’, where the objective is remodelled

as a supervised learning problem. Limited experiments demonstrate promising and competitive

results.

The paper can be accessed here.

As in the previous assignments, please limit your reviews to 350 words.

Briefly summarize the key findings, strengths and potential limitations of this work.

6 Coding: Dynamic Programming and Deep Q-Learning [60 points

+ 5 bonus points]

Follow the instructions at this link to the HW4 coding webpage.

2

https://en.wikipedia.org/wiki/Hoeffding%27s_inequality#General_case_of_bounded_random_variables

3A trajectory is defined as a sequence of states and actions i.e. (s0, a0, s1, a1, . . . sn−1, an−1, sn)

5