Sale!

# CS4803/7643: Problem Set 4 solved

Original price was: \$35.00.Current price is: \$30.00.

Category:

## Description

1 Optimal Policy and Value Function [15 points]
Consider a simple 2-state MDP shown in Figure 1, S = {S1, S2}. From each state, there are two
available actions A = {stay, go}. The reward received for taking an action at a state is shown in
the figure for each (s, a) pair. Also, taking action go from state S2 ends the episode. All transitions
are deterministic.
Figure 1: 2-state MDP
(a) Consider a policy that always chooses the stay action at every state. Compute the sum of
discounted rewards obtained by this policy starting at state S1, assuming a discount of γ i.e.
compute P∞
t=0
γ
t
rt(st
, at). [4 points]
(b) What is the optimal policy? Write it as a tuple (a1, a2) of the optimal actions at (s1, s2)
respectively. Compute the sum of discounted rewards obtained by this policy, given that the
start state is S1, with discount γ. [4 points]
(c) Recall that one update of the value iteration algorithm performs the following operation,
where r(s, a) denotes the reward for taking action a at state s.
For each s ∈ S :
V
i+1(s) ← max
a
X
s
0
p(s
0
|s, a)

r(s, a) + γV i
(s
0
)

(1)
Note: Since taking action go at s2 will terminate the episode, the value of taking the go
action at s2 is just r(s2, go) instead of r(s2, go) + γV (s1) (since you don’t care about future
values after the episode has ended). Thus the update for state V (s2) will look like:
V
i+1(s2) ← max
r(s2, stay) + γV i
(s2), r(s2, go)

(2)
The value function V is stored as an array of size |S| ( |S| = 2 in our case). For example,
the array V
0 = [0, 1] denotes that V
0
(s1) = 0, V 0
(s2) = 1. Starting with a initial V
0 = [0, 0],
peform the value iteration update to compute V
1
, V 2
, V 3 assuming a discount factor γ = 1.
What is the optimal V

? [7 points]
2 Value Iteration Convergence [25 points + 5 bonus points]
In part (c) of the previous question, we computed V
i
for a few updates of value iteration and also
V

, the optimal value function. What happened to the “error” in the value estimate at every update
i.e. did |V
i
(s)−V

(s)| every increase for a particular state s? The answer is yes, the value estimate

for any s may fluctuate before eventually converging to V

(s). If this is the case, do we have any
hope that the value iteration algorithm will converge to V

? In this question, we will prove that
value iteration indeed converges due to the monotonic decrease in a different error – the maximum
error over all states max
s∈S
|V
i
(s) − V

(s)| or the L∞ norm

V
i − V

.
(a) For V
1
, V 2
, V 3 obtained in 1(c), compute

V
i − V

and verify that this error decreased
monotonically. [5 points]
(b) Let T : R
|S| → R
|S| be the function that takes the vector V
i as input and applies the value
iteration update to produce V
i+1. We will show that for any two vectors V, V 0 ∈ R
|S|, this
operator decreases the L∞ norm between them.
Prove that: kT(V ) − T(V
0
)k∞ ≤ γ kV − V
0k∞ for any V, V 0 ∈ R
|S|
, [10 points]
(c) Now consider the sequence of value functions {V
n} obtained by iteratively performing the
value iteration update as per Eq. 1. This sequence has the interesting property of being a
cauchy sequence i.e. the elements of the sequence become arbitrarily close to each other as the
sequence progresses. Formally, we say a sequence is cauchy w.r.t. the L∞ norm if for every
positive number , there is a positive integer N s.t. for all positive integers m, n greater than
N, kxm − xnk∞ < . Equipped with this fact, show that the error of V n+1 w.r.t the optimal value function can be bounded as: ∀ > 0, ∃N > 0 s.t. ∀n > N

V
n+1 − V

γ
1 − γ
, (3)
[10 points]
Hint: First try to prove that

V
n+1 − V

γ
1−γ

V
n − V
n+1

, then apply the Cauchy
sequence condition on V
n
Discussion: Given Equation 3, we now have a guarantee that value iteration converges to V

.
However, we made the assumption that {V
n} is a Cauchy sequence. As a bonus question below, we
will prove that the condition in (a) is sufficient to conclude that {V
n} is indeed a cauchy sequence.
(d) [BONUS] Given a function T : R
n → R
n
such that kT(x1) − T(x2)k∞ ≤ α kx1 − x2k∞,
α ∈ [0, 1), prove first that T has a fixed point i.e. ∃x

: T(x

) = x
∗ and second that the fixed
point of T is unique. [5 bonus points]
3 Learning the Model. [20 bonus points]
This is a challenging question with all sub-parts as bonus credit.
While value iteration (VI) allows us to find the optimal policy, performing VI updates (Eq. 1)
requires the transition function T(s, a) := p(s
0
| s, a)
1 and the reward function R(s, a). In many
practical situations, these quantities are unknown. However, we can step through the environment
and observe transitions to collect data of the form (s, a, s0
, r). Using this data, if we can obtain
an estimate of the transition and reward functions, we can hope to still find the optimal policy via
value iteration. Let M denote the true (unknown to us) model of the world (a model consists of
1we will denote T(s, a, s0
) as the probability value of s
0
give (s, a), whereas T(s, a) denotes the entire probability
distribution p(s
0
| s, a)
3
both a transition and reward function), and let Mˆ represent our approximate model of the world
that we estimate from data.
In this question, given estimates of both the transition function T(s, a) = p(s
0
| s, a) ∈ ∆|S|−1
and the reward function R(s, a), we want to study the error of acting optimally in this estimated
MDP as opposed to acting optimally in the “true” MDP. Naturally, we should expect this error to
be smaller as we get better estimates which in turn is proportional to how many observations we
make.
(a) [BONUS] For the first part of this problem, assume that we have used our favorite learning
method to obtain estimates of the transition function Tˆ and the reward function Rˆ. Given
that maxs,a |Rˆ(s, a) − R(s, a)| ≤ R and maxs,a ||Tˆ(s, a) − T(s, a)||1 ≤ P , then for any policy
π : S → A, ∀s ∈ S, show that [10 bonus points]
||V
π
Mˆ − V
π
M||∞ ≤
R
1 − γ
+
γP
(1 − γ)
2
(4)
Assume that the reward function is in [0, 1].
Hint: 1) Expand VM(s) using the Bellman equation 2) Holder’s Inequality: ||u · v||1 ≤
||u||1 · ||v||∞
(b) [BONUS] We just bounded the error of acting according to any policy π : S → A under the
estimated model, Mˆ vs. the true model, M. Now, we use this to bound the value error in
acting optimally vs. acting optimally according to the estimated model (π

).
Using (4), show that [3 bonus points]
V

M(s) − V
π

M (s) ≤ 2 sup
π:S→A
||V
π
Mˆ − V
π
M||∞ (5)
(c) [BONUS] Given
R =
r
1
2n
log 4|S × A|
δ
(6)
P = |S| · r
1
2n
log 4|S × A × S|
δ
(7)
use (4) to simplify (5) i.e. V

M(s)−V
π

M
M (s). What is the dependence on this error and number
of observations required for each state action pair, n? [2 bonus points]
(d) [BONUS] Given a dataset Ds,a = {(s, a, s0
1
, r1), . . .(s, a, s0
n
, rn)} we estimate the unknown
transition and reward function using empirical frequencies as:
Tˆ(s, a, s0
) = 1
n
Xn
i=1
1(s
0
i = s
0
) (8)
Rˆ(s, a) = 1
n
Xn
i=1
r (9)
4
For notational convenience, we will use Tˆ(s, a) to denote a vector of size |S| whose element is
the probability computed in (8) for each next state, s
0 ∈ S. Prove that R and P take values
as in (6) and (7). [5 bonus points]
Hint: Hoeffding’s inequality2
4 Policy Gradients Variance Reduction [10 points]
In class (will be covered on Oct 29th, but you don’t really need to know the derivation for this
question), we derived the policy gradient as
∇θJ(θ) = ∇θEτ∼πθ
[R(τ )] (10)
= Eτ∼πθ
[R(τ )∇ log πθ(τ )] (11)

1
N
X
N
i=1
R(τi)∇θ log πθ(τi) (12)
where τ is a trajectory3
, R(τ ), the associated reward and N, the number of trials to estimate the
expected reward, J(·), the expected reward and θ, the parameters of the policy. These gradients
are often very noisy and lead to unstable training. In this question, we show a simple method to
reduce the variance in the gradients.
(a) Show that adjusting the reward as R(τ ) := R(τ ) − b does not change the gradient estimate
in Eq. 10. [2 points]
(b) Compute the variance of ∇θJ(θ) and notice how subtracting b helped reduce it. What value
of b will lead to the least variance? [8 points]
Hint: The variance of J(θ) is written as V ar(x) = E[x
2
] − (E[x])2 where x is substituted as
R(τ )∇θ log πθ(τ )
5 Paper Review: [4 points]
In this course’s final paper review section, we examine an interesting project that upends conventional research in reinforcement learning. While most RL problems are aimed at maximizing
rewards, this work proposes ’upside down reinforcement learning’, where the objective is remodelled
as a supervised learning problem. Limited experiments demonstrate promising and competitive
results.
The paper can be accessed here.
As in the previous assignments, please limit your reviews to 350 words.
Briefly summarize the key findings, strengths and potential limitations of this work.
6 Coding: Dynamic Programming and Deep Q-Learning [60 points
+ 5 bonus points]
2
https://en.wikipedia.org/wiki/Hoeffding%27s_inequality#General_case_of_bounded_random_variables
3A trajectory is defined as a sequence of states and actions i.e. (s0, a0, s1, a1, . . . sn−1, an−1, sn)
5