Sale!

COMP 597: Reinforcement Learning Assignment 1 solution

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (3 votes)

Bandit algorithms [100 points]
For this assignment, you will carry out some experimentation with bandit algorithms, in order to help
you understand what we discussed in class, and to get used to the way in which we will run experiments
for other assignments as well. You should submit a notebook with your code, results and explanations.
1. [5 points] Write a small simulator for a Bernoulli bandit with k arms. The probability of success
pi for each arm i ∈ {1, …k} should be provided as an input. The bandit should have a function
called ”sample” which takes as input the index of an action and provides a reward sample. Recall
that a Bernoulli bandit outputs either 1 or 0, drawn from a binomial distribution of parameter pk.
Test your code with 3 arms of parameters q∗ = [0.5, , 0.5 − δ, 0.5 − 2δ, with δ = 0.1. Generate and
save a set of 50 samples for each action. For the test, plot one graph for each action, containing
the reward values obtained over the 100 draws, the empirical mean of the values, and the true q∗
for each arm. Each graph will have an x-axis that goes to 50, two horizontal lines (true value and
estimated value) and a set of points of value 0 and 1.
2. [5 points] Code the rule for estimating action values discussed in lecture 2, with a fixed learning
rate α, in a function called update, and using the incremental computation of the mean presented
in lecture 2, in a function called updateAvg. Using the previous data, plot for each action a graph
showing the estimated q value as a function of the number of samples, using averaging as well as
α = 0.01 and α = 0.1, and the true value. Each graph should have two curves and a horizontal
line.
3. [10 points] Repeat the above experiment 100 times, starting with action value estimates of 0. Each
run will still contain 100 samples for each action. Plot the same graph as above, but where the
curves have the average and standard error over the 100 runs. Explain in 1-2 sentences what you
observe. Which of the α values is better? How do they compare to averaging? If you wanted to
optimize further, in what range of α would you look for better values?
4. [20 points] Code the -greedy algorithm discussed in class, with averaging updates, with  provided
as an input. You will run 100 independent runs, each consisting of 1000 time steps. Plot the
following graphs:
(a) The reward received over time, averaged at each time step over the 100 independent runs
(with no smoothing over the time steps), and the standard error over the 100 runs
(b) The fraction of runs (out of 100) in which the first action (which truly is best) is also estimated
best based on the action values
1
(c) The instantaneous regret lt (as discussed in lecture 3) (averaged over the 100 runs)
(d) The total regret Lt up to time step t (as discussed in lecture 3) (averaged over the 100 runs)
Generate this set of graphs, for the following values of : 0, 1/8, 1/4, 1/2, 1. Explain what you
observe in the graphs and discuss the effect of  you observe.
5. [5 points] For  = 1/4 and  = 1/8, plot the same graphs for α = 0.1, α = 0.01, α = 0.001 and
averaging. Explain in 2 sentences what you observe.
6. [20 points] Write a function that implements the UCB algorithm discussed in lecture 2. Set c = 2.
Plot the same graphs as above for α = 0.1, α = 0.01, α = 0.001 and averaging. Explain briefly
the behavior you observe.
7. [20 points] Write a function that implements the Thompson sampling to be discussed in lecture 4.
Plot the same graphs as above for α = 0.1, α = 0.01, α = 0.001 and averaging. Explain briefly
the behavior you observe.
8. [5 points] For each of the algorithms, pick the best hyper-parameter combination you have observed (explain how you decided what ”best” means). Plot together the curves for this setting.
Comment on the relative behavior of the different algorithms.
9. [10 points] Let us now consider a non-stationary problem. Let δ = 0.1 and imagine that after 500
time steps, the parameter of actions 2 and 3 become 0.5 + δ and 0.5 + 2δ respectively. Run for
each of the three algorithms a fixed value of α = 0.1 and the averaging value estimation. For  use
values 1/4 and 1/8. Using these values, plot only the reward graph as above (you should have 2
lines for -greedy, one for UCB and one for Thompson sampling, for each learning rate setting).
Explain what you see in the graph. Based on these results, which algorithm is best suited to cope
with non-stationarity?
2