Sale!

EECE 5698 Project #1 solved

Original price was: $35.00.Current price is: $30.00.

Download Details:

  • Name: Project1-u8lz1w.zip
  • Type: zip
  • Size: 903.66 KB

Category:

Description

5/5 - (4 votes)

Think of an agent that plays a 2-armed bandit, trying to maximize its total reward. In each step, the
agent selects one of the levers and is given some reward according to the reward distribution of that
lever. Assume that reward distribution for the first lever is a Gaussian with πœ‡πœ‡1 = 5, 𝜎𝜎1
2 = 10, and for
the second lever is a binomial Gaussian with πœ‡πœ‡21 = 10, 𝜎𝜎21
2 = 15, πœ‡πœ‡22 = 4, 𝜎𝜎22
2 = 10, which means
that the resulting output will be uniformly probable from these two Gaussian distributions (See
http://en.wikipedia.org/wiki/Mixture_distribution).
If these distributions were known (which in practice are not), we could compute the optimal/true
action values as:
π‘„π‘„βˆ—(π‘Žπ‘Ž1) = 𝐸𝐸[𝑅𝑅1] = πœ‡πœ‡1 = 5
π‘„π‘„βˆ—(π‘Žπ‘Ž2) = 𝐸𝐸[𝑅𝑅2] = 1
2
Γ— πœ‡πœ‡21 +
1
2
Γ— πœ‡πœ‡22 = 7
However, in this problem, we assume the reward distributions are unknown, and the agent only
sees a realization of reward after selecting an action. The agent takes action according to the πœ–πœ–-
greedy action selection policy with parameter πœ–πœ–:
πœ‹πœ‹πœ–πœ–βˆ’π‘”π‘”π‘”π‘”π‘”π‘”π‘”π‘”π‘”π‘”π‘”π‘” = οΏ½
aβˆ— = argπ‘šπ‘šπ‘šπ‘šπ‘šπ‘š 𝑄𝑄(π‘Žπ‘Ž) with probability 1 βˆ’ πœ–πœ–
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 with probability πœ–πœ–
We consider the agent selects 1000 actions, which is referred to as step/time. In order to have
smooth results, we repeat 1000 steps for 100 independent runs.
a) In this part, set the initial Q values at the beginning of each run as 𝑄𝑄(π‘Žπ‘Ž1) = 𝑄𝑄(π‘Žπ‘Ž2) = 0. Assuming
action π‘Žπ‘Ž is selected at time step π‘˜π‘˜ and the reward π‘Ÿπ‘Ÿπ‘˜π‘˜ is observed, the Q-value for the
corresponding action will be updated according to: 𝑄𝑄(π‘Žπ‘Ž) = 𝑄𝑄(π‘Žπ‘Ž) + 𝛼𝛼( r βˆ’ 𝑄𝑄(π‘Žπ‘Ž) ). For the
learning rates, consider the following values: 𝛼𝛼 = 1, 𝛼𝛼 = 0.9π‘˜π‘˜ , 𝛼𝛼 = 1
1+Ln(1+π‘˜π‘˜) and 𝛼𝛼 = 1
π‘˜π‘˜
and for
the πœ–πœ–-greedy policy, use πœ–πœ– = 0, 0.1, 0.2, 0.5. You need to provide your results in terms of
average accumulated reward with respect to time/step (see the following plot). Here is a brief
guideline:
For the 𝑖𝑖th independent run, you need to keep track of accumulated rewards as:
π΄π΄π΄π΄π΄π΄π‘…π‘…π‘˜π‘˜
𝑖𝑖 = 1
π‘˜π‘˜οΏ½π‘Ÿπ‘Ÿπ‘—π‘—
π‘˜π‘˜
𝑗𝑗=1
where π΄π΄π΄π΄π΄π΄π‘…π‘…π‘˜π‘˜
𝑖𝑖 denotes the average reward per step obtained by the agent up to the time step π‘˜π‘˜
in the ith independent run. Then the average over 100 independent runs of accumulated
rewards π΄π΄π΄π΄π΄π΄π‘…π‘…π‘˜π‘˜ can be obtained at any given step/time π‘˜π‘˜ = 1, … ,1000 as:
π΄π΄π΄π΄π΄π΄π‘…π‘…π‘˜π‘˜ = 1
100οΏ½π΄π΄π΄π΄π΄π΄π‘…π‘…π‘˜π‘˜
𝑖𝑖
100
𝑖𝑖=1
Therefore, in the example of the plot shown on the next page, π΄π΄π΄π΄π΄π΄π‘…π‘…π‘˜π‘˜ is in the y-axis and π‘˜π‘˜ in
the x-axis.
You expect to have four plots, which each one is associated with a learning rate and includes
four curves for four different πœ–πœ– values.
For all pairs of learning rate and the policy parameter (i.e., each curve), you also need to report
need to include the average action values 𝑄𝑄(π‘Žπ‘Ž1) and 𝑄𝑄(π‘Žπ‘Ž2) after finishing 1000 steps over 100
runs. An example of the table for this result is shown below.
Expected Results: Four Plots and Four Tables.
Figure 1 Average Accumulated Reward for Ξ±=1/(1+ln(1+k))
Table 1 Average Q-values for 𝛼𝛼 = 1
1+𝑙𝑙𝑙𝑙 (1+π‘˜π‘˜)
Epsilon-greedy Average of action
value 𝑄𝑄(π‘Žπ‘Ž1) of 100
runs
True action value
π‘Έπ‘Έβˆ—(π’‚π’‚πŸπŸ)
Average of action
value 𝑄𝑄(π‘Žπ‘Ž2) of
100 runs
True action value
π‘Έπ‘Έβˆ—(π’‚π’‚πŸπŸ)
Ξ΅ = 0 (greedy ) 5 7
Ξ΅ = 0.1 5 7
Ξ΅ = 0.2 5 7
Ξ΅ = 0.5 (random) 5 7
b) For a fixed 𝛼𝛼 = 0.1 and πœ–πœ– = 0.1, use the following optimistic initial values and compare the
results: 𝑄𝑄 = [0 0], 𝑄𝑄 = [5 7], 𝑄𝑄 = [20 20] (note that 𝑄𝑄 = [𝑄𝑄(π‘Žπ‘Ž1) 𝑄𝑄(π‘Žπ‘Ž2)].) Plot the average
accumulated reward with respect to step/time in a single plot with four curves, where each
curve is associated with a single initial Q-values. The average action values should be reported
in the following table.
Initial 𝑄𝑄 values Average of
action value
𝑄𝑄(π‘Žπ‘Ž1) of 100
runs
True action value
π‘Έπ‘Έβˆ—(π’‚π’‚πŸπŸ)
Average of
action value
𝑄𝑄(π‘Žπ‘Ž2) of 100
runs
True action
value π‘Έπ‘Έβˆ—(π’‚π’‚πŸπŸ)
Q =[0 0] 5 7
Q =[5 7] 5 7
Q =[20 20] 5 7
c) For a fixed 𝛼𝛼 = 0.1, use the Gradient-Bandit policy with 𝐻𝐻1(π‘Žπ‘Ž1) = 𝐻𝐻1(π‘Žπ‘Ž2) = 0. Plot the average
accumulated reward with respect to step/time. How the results are different from πœ–πœ–-greedy
results with 𝑄𝑄(π‘Žπ‘Ž1) = 𝑄𝑄(π‘Žπ‘Ž2) = 0, 𝛼𝛼 = 0.1 and πœ–πœ– = 0.1? You might choose to plot both curves on
top of each other for comparison purposes.
Expected Results: One Plot.
πœ‹πœ‹π‘‘π‘‘(π‘Žπ‘Ž) = 𝑒𝑒𝑒𝑒𝑒𝑒(𝐻𝐻𝑑𝑑(π‘Žπ‘Ž))
𝑒𝑒𝑒𝑒𝑒𝑒(𝐻𝐻𝑑𝑑(π‘Žπ‘Ž1)) + 𝑒𝑒𝑒𝑒𝑒𝑒(𝐻𝐻𝑑𝑑(π‘Žπ‘Ž2))
𝐻𝐻𝑑𝑑+1(π‘Žπ‘Žπ‘‘π‘‘) = 𝐻𝐻𝑑𝑑(π‘Žπ‘Žπ‘‘π‘‘) + 𝛼𝛼(𝑅𝑅𝑑𝑑 βˆ’ 𝑅𝑅�𝑑𝑑)(1 βˆ’ πœ‹πœ‹π‘‘π‘‘(π‘Žπ‘Žπ‘‘π‘‘)), for selected action at time t
𝐻𝐻𝑑𝑑+1(π‘Žπ‘Ž) = 𝐻𝐻𝑑𝑑(π‘Žπ‘Ž) βˆ’ 𝛼𝛼(𝑅𝑅𝑑𝑑 βˆ’ 𝑅𝑅�𝑑𝑑)πœ‹πœ‹π‘‘π‘‘(π‘Žπ‘Ž), for the action that is not chosen at time t
𝑅𝑅�𝑑𝑑 = (π‘Ÿπ‘Ÿ1 + β‹― + π‘Ÿπ‘Ÿπ‘‘π‘‘)/𝑑𝑑
Important Note: For all results, you need to interpret/explain your findings. For instance, in part (a),
you need to explain which of greedy, random, in-between policies performed the best, which
learning rate was the best, which pair of 𝛼𝛼 and πœ–πœ– led to the maximum average accumulated reward,
etc. For part (b), you also need to explain how optimistic initial values impact the overall
performance of the selection process and which choice isthe best. For part (c), you need to compare
the results of gradient-based policy with previously computed πœ–πœ–-greedy policy.
Questions about the project should be directed to TA, Begum Taskazan, at
taskazan.b@northeastern.edu.