Description

5/5 - (4 votes)

Think of an agent that plays a 2-armed bandit, trying to maximize its total reward. In each step, the
agent selects one of the levers and is given some reward according to the reward distribution of that
lever. Assume that reward distribution for the first lever is a Gaussian with 𝜇𝜇1 = 5, 𝜎𝜎1
2 = 10, and for
the second lever is a binomial Gaussian with 𝜇𝜇21 = 10, 𝜎𝜎21
2 = 15, 𝜇𝜇22 = 4, 𝜎𝜎22
2 = 10, which means
that the resulting output will be uniformly probable from these two Gaussian distributions (See
http://en.wikipedia.org/wiki/Mixture_distribution).
If these distributions were known (which in practice are not), we could compute the optimal/true
action values as:
𝑄𝑄∗(𝑎𝑎1) = 𝐸𝐸[𝑅𝑅1] = 𝜇𝜇1 = 5
𝑄𝑄∗(𝑎𝑎2) = 𝐸𝐸[𝑅𝑅2] = 1
2
× 𝜇𝜇21 +
1
2
× 𝜇𝜇22 = 7
However, in this problem, we assume the reward distributions are unknown, and the agent only
sees a realization of reward after selecting an action. The agent takes action according to the 𝜖𝜖-
greedy action selection policy with parameter 𝜖𝜖:
𝜋𝜋𝜖𝜖−𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 = �
a∗ = arg𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄(𝑎𝑎) with probability 1 − 𝜖𝜖
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 with probability 𝜖𝜖
We consider the agent selects 1000 actions, which is referred to as step/time. In order to have
smooth results, we repeat 1000 steps for 100 independent runs.
a) In this part, set the initial Q values at the beginning of each run as 𝑄𝑄(𝑎𝑎1) = 𝑄𝑄(𝑎𝑎2) = 0. Assuming
action 𝑎𝑎 is selected at time step 𝑘𝑘 and the reward 𝑟𝑟𝑘𝑘 is observed, the Q-value for the
corresponding action will be updated according to: 𝑄𝑄(𝑎𝑎) = 𝑄𝑄(𝑎𝑎) + 𝛼𝛼( r − 𝑄𝑄(𝑎𝑎) ). For the
learning rates, consider the following values: 𝛼𝛼 = 1, 𝛼𝛼 = 0.9𝑘𝑘 , 𝛼𝛼 = 1
1+Ln(1+𝑘𝑘) and 𝛼𝛼 = 1
𝑘𝑘
and for
the 𝜖𝜖-greedy policy, use 𝜖𝜖 = 0, 0.1, 0.2, 0.5. You need to provide your results in terms of
average accumulated reward with respect to time/step (see the following plot). Here is a brief
guideline:
For the 𝑖𝑖th independent run, you need to keep track of accumulated rewards as:
𝐴𝐴𝐴𝐴𝐴𝐴𝑅𝑅𝑘𝑘
𝑖𝑖 = 1
𝑘𝑘�𝑟𝑟𝑗𝑗
𝑘𝑘
𝑗𝑗=1
where 𝐴𝐴𝐴𝐴𝐴𝐴𝑅𝑅𝑘𝑘
𝑖𝑖 denotes the average reward per step obtained by the agent up to the time step 𝑘𝑘
in the ith independent run. Then the average over 100 independent runs of accumulated
rewards 𝐴𝐴𝐴𝐴𝐴𝐴𝑅𝑅𝑘𝑘 can be obtained at any given step/time 𝑘𝑘 = 1, … ,1000 as:
𝐴𝐴𝐴𝐴𝐴𝐴𝑅𝑅𝑘𝑘 = 1
100�𝐴𝐴𝐴𝐴𝐴𝐴𝑅𝑅𝑘𝑘
𝑖𝑖
100
𝑖𝑖=1
Therefore, in the example of the plot shown on the next page, 𝐴𝐴𝐴𝐴𝐴𝐴𝑅𝑅𝑘𝑘 is in the y-axis and 𝑘𝑘 in
the x-axis.
You expect to have four plots, which each one is associated with a learning rate and includes
four curves for four different 𝜖𝜖 values.
For all pairs of learning rate and the policy parameter (i.e., each curve), you also need to report
need to include the average action values 𝑄𝑄(𝑎𝑎1) and 𝑄𝑄(𝑎𝑎2) after finishing 1000 steps over 100
runs. An example of the table for this result is shown below.
Expected Results: Four Plots and Four Tables.
Figure 1 Average Accumulated Reward for α=1/(1+ln(1+k))
Table 1 Average Q-values for 𝛼𝛼 = 1
1+𝑙𝑙𝑙𝑙 (1+𝑘𝑘)
Epsilon-greedy Average of action
value 𝑄𝑄(𝑎𝑎1) of 100
runs
True action value
𝑸𝑸∗(𝒂𝒂𝟏𝟏)
Average of action
value 𝑄𝑄(𝑎𝑎2) of
100 runs
True action value
𝑸𝑸∗(𝒂𝒂𝟐𝟐)
ε = 0 (greedy ) 5 7
ε = 0.1 5 7
ε = 0.2 5 7
ε = 0.5 (random) 5 7
b) For a fixed 𝛼𝛼 = 0.1 and 𝜖𝜖 = 0.1, use the following optimistic initial values and compare the
results: 𝑄𝑄 = [0 0], 𝑄𝑄 = [5 7], 𝑄𝑄 = [20 20] (note that 𝑄𝑄 = [𝑄𝑄(𝑎𝑎1) 𝑄𝑄(𝑎𝑎2)].) Plot the average
accumulated reward with respect to step/time in a single plot with four curves, where each
curve is associated with a single initial Q-values. The average action values should be reported
in the following table.
Initial 𝑄𝑄 values Average of
action value
𝑄𝑄(𝑎𝑎1) of 100
runs
True action value
𝑸𝑸∗(𝒂𝒂𝟏𝟏)
Average of
action value
𝑄𝑄(𝑎𝑎2) of 100
runs
True action
value 𝑸𝑸∗(𝒂𝒂𝟐𝟐)
Q =[0 0] 5 7
Q =[5 7] 5 7
Q =[20 20] 5 7
c) For a fixed 𝛼𝛼 = 0.1, use the Gradient-Bandit policy with 𝐻𝐻1(𝑎𝑎1) = 𝐻𝐻1(𝑎𝑎2) = 0. Plot the average
accumulated reward with respect to step/time. How the results are different from 𝜖𝜖-greedy
results with 𝑄𝑄(𝑎𝑎1) = 𝑄𝑄(𝑎𝑎2) = 0, 𝛼𝛼 = 0.1 and 𝜖𝜖 = 0.1? You might choose to plot both curves on
top of each other for comparison purposes.
Expected Results: One Plot.
𝜋𝜋𝑡𝑡(𝑎𝑎) = 𝑒𝑒𝑒𝑒𝑒𝑒(𝐻𝐻𝑡𝑡(𝑎𝑎))
𝑒𝑒𝑒𝑒𝑒𝑒(𝐻𝐻𝑡𝑡(𝑎𝑎1)) + 𝑒𝑒𝑒𝑒𝑒𝑒(𝐻𝐻𝑡𝑡(𝑎𝑎2))
𝐻𝐻𝑡𝑡+1(𝑎𝑎𝑡𝑡) = 𝐻𝐻𝑡𝑡(𝑎𝑎𝑡𝑡) + 𝛼𝛼(𝑅𝑅𝑡𝑡 − 𝑅𝑅�𝑡𝑡)(1 − 𝜋𝜋𝑡𝑡(𝑎𝑎𝑡𝑡)), for selected action at time t
𝐻𝐻𝑡𝑡+1(𝑎𝑎) = 𝐻𝐻𝑡𝑡(𝑎𝑎) − 𝛼𝛼(𝑅𝑅𝑡𝑡 − 𝑅𝑅�𝑡𝑡)𝜋𝜋𝑡𝑡(𝑎𝑎), for the action that is not chosen at time t
𝑅𝑅�𝑡𝑡 = (𝑟𝑟1 + ⋯ + 𝑟𝑟𝑡𝑡)/𝑡𝑡
Important Note: For all results, you need to interpret/explain your findings. For instance, in part (a),
you need to explain which of greedy, random, in-between policies performed the best, which
learning rate was the best, which pair of 𝛼𝛼 and 𝜖𝜖 led to the maximum average accumulated reward,
etc. For part (b), you also need to explain how optimistic initial values impact the overall
performance of the selection process and which choice isthe best. For part (c), you need to compare
the results of gradient-based policy with previously computed 𝜖𝜖-greedy policy.
Questions about the project should be directed to TA, Begum Taskazan, at
taskazan.b@northeastern.edu.

EECE 5698 Project #1 solved

Download Details:

Description