Description
Think of an agent that plays a 2-armed bandit, trying to maximize its total reward. In each step, the
agent selects one of the levers and is given some reward according to the reward distribution of that
lever. Assume that reward distribution for the first lever is a Gaussian with ππ1 = 5, ππ1
2 = 10, and for
the second lever is a binomial Gaussian with ππ21 = 10, ππ21
2 = 15, ππ22 = 4, ππ22
2 = 10, which means
that the resulting output will be uniformly probable from these two Gaussian distributions (See
http://en.wikipedia.org/wiki/Mixture_distribution).
If these distributions were known (which in practice are not), we could compute the optimal/true
action values as:
ππβ(ππ1) = πΈπΈ[π
π
1] = ππ1 = 5
ππβ(ππ2) = πΈπΈ[π
π
2] = 1
2
Γ ππ21 +
1
2
Γ ππ22 = 7
However, in this problem, we assume the reward distributions are unknown, and the agent only
sees a realization of reward after selecting an action. The agent takes action according to the ππ-
greedy action selection policy with parameter ππ:
ππππβππππππππππππ = οΏ½
aβ = argππππππ ππ(ππ) with probability 1 β ππ
π
π
π
π
π
π
π
π
π
π
π
π
with probability ππ
We consider the agent selects 1000 actions, which is referred to as step/time. In order to have
smooth results, we repeat 1000 steps for 100 independent runs.
a) In this part, set the initial Q values at the beginning of each run as ππ(ππ1) = ππ(ππ2) = 0. Assuming
action ππ is selected at time step ππ and the reward ππππ is observed, the Q-value for the
corresponding action will be updated according to: ππ(ππ) = ππ(ππ) + πΌπΌ( r β ππ(ππ) ). For the
learning rates, consider the following values: πΌπΌ = 1, πΌπΌ = 0.9ππ , πΌπΌ = 1
1+Ln(1+ππ) and πΌπΌ = 1
ππ
and for
the ππ-greedy policy, use ππ = 0, 0.1, 0.2, 0.5. You need to provide your results in terms of
average accumulated reward with respect to time/step (see the following plot). Here is a brief
guideline:
For the ππth independent run, you need to keep track of accumulated rewards as:
π΄π΄π΄π΄π΄π΄π
π
ππ
ππ = 1
πποΏ½ππππ
ππ
ππ=1
where π΄π΄π΄π΄π΄π΄π
π
ππ
ππ denotes the average reward per step obtained by the agent up to the time step ππ
in the ith independent run. Then the average over 100 independent runs of accumulated
rewards π΄π΄π΄π΄π΄π΄π
π
ππ can be obtained at any given step/time ππ = 1, β¦ ,1000 as:
π΄π΄π΄π΄π΄π΄π
π
ππ = 1
100οΏ½π΄π΄π΄π΄π΄π΄π
π
ππ
ππ
100
ππ=1
Therefore, in the example of the plot shown on the next page, π΄π΄π΄π΄π΄π΄π
π
ππ is in the y-axis and ππ in
the x-axis.
You expect to have four plots, which each one is associated with a learning rate and includes
four curves for four different ππ values.
For all pairs of learning rate and the policy parameter (i.e., each curve), you also need to report
need to include the average action values ππ(ππ1) and ππ(ππ2) after finishing 1000 steps over 100
runs. An example of the table for this result is shown below.
Expected Results: Four Plots and Four Tables.
Figure 1 Average Accumulated Reward for Ξ±=1/(1+ln(1+k))
Table 1 Average Q-values for πΌπΌ = 1
1+ππππ (1+ππ)
Epsilon-greedy Average of action
value ππ(ππ1) of 100
runs
True action value
πΈπΈβ(ππππ)
Average of action
value ππ(ππ2) of
100 runs
True action value
πΈπΈβ(ππππ)
Ξ΅ = 0 (greedy ) 5 7
Ξ΅ = 0.1 5 7
Ξ΅ = 0.2 5 7
Ξ΅ = 0.5 (random) 5 7
b) For a fixed πΌπΌ = 0.1 and ππ = 0.1, use the following optimistic initial values and compare the
results: ππ = [0 0], ππ = [5 7], ππ = [20 20] (note that ππ = [ππ(ππ1) ππ(ππ2)].) Plot the average
accumulated reward with respect to step/time in a single plot with four curves, where each
curve is associated with a single initial Q-values. The average action values should be reported
in the following table.
Initial ππ values Average of
action value
ππ(ππ1) of 100
runs
True action value
πΈπΈβ(ππππ)
Average of
action value
ππ(ππ2) of 100
runs
True action
value πΈπΈβ(ππππ)
Q =[0 0] 5 7
Q =[5 7] 5 7
Q =[20 20] 5 7
c) For a fixed πΌπΌ = 0.1, use the Gradient-Bandit policy with π»π»1(ππ1) = π»π»1(ππ2) = 0. Plot the average
accumulated reward with respect to step/time. How the results are different from ππ-greedy
results with ππ(ππ1) = ππ(ππ2) = 0, πΌπΌ = 0.1 and ππ = 0.1? You might choose to plot both curves on
top of each other for comparison purposes.
Expected Results: One Plot.
πππ‘π‘(ππ) = ππππππ(π»π»π‘π‘(ππ))
ππππππ(π»π»π‘π‘(ππ1)) + ππππππ(π»π»π‘π‘(ππ2))
π»π»π‘π‘+1(πππ‘π‘) = π»π»π‘π‘(πππ‘π‘) + πΌπΌ(π
π
π‘π‘ β π
π
οΏ½π‘π‘)(1 β πππ‘π‘(πππ‘π‘)), for selected action at time t
π»π»π‘π‘+1(ππ) = π»π»π‘π‘(ππ) β πΌπΌ(π
π
π‘π‘ β π
π
οΏ½π‘π‘)πππ‘π‘(ππ), for the action that is not chosen at time t
π
π
οΏ½π‘π‘ = (ππ1 + β― + πππ‘π‘)/π‘π‘
Important Note: For all results, you need to interpret/explain your findings. For instance, in part (a),
you need to explain which of greedy, random, in-between policies performed the best, which
learning rate was the best, which pair of πΌπΌ and ππ led to the maximum average accumulated reward,
etc. For part (b), you also need to explain how optimistic initial values impact the overall
performance of the selection process and which choice isthe best. For part (c), you need to compare
the results of gradient-based policy with previously computed ππ-greedy policy.
Questions about the project should be directed to TA, Begum Taskazan, at
taskazan.b@northeastern.edu.


