Sale!

CSE 541 Homework 3 Interactive Learning Solution

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (6 votes)

Contextual Bandits
1. Problem 18.8 of [SzepesvariLattimore].
2. In this exercise we will implement several contextual bandit algorithms. We will “fake” a contextual bandit problem with multi-class classification dataset where each example is context, and the learner chooses an
“action” among the available class labels, and receives a reward of 1 if the guess was correct, and 0 otherwise.
However, keeping with bandit feedback, we assume the learner only knows the reward of the action played,
not all actions.
We will use the MNIST dataset1
. The MNIST dataset contains 28×28 images of handwritten digits from
0-9. Download this dataset and use the python-mnist library2
to load it into Python. Rather than using
the full images, you may run PCA on the data to come up with a lower dimensional representation of each
image. You will have to experiment with what dimension, d, to use. Scale all images so that they are norm
1.
Let the d dimensional representation of the tth image in the dataset, ct, be our “context.” Our action
set A = {0, 1, . . . , 9} has 10 actions associated with each label. For each i ∈ A = {0, 1, . . . , 9} define the
feature map φ(c, i) = vec(ce
>
i
) ∈ R
10d
. If v(c, a) is the expected reward of playing action a ∈ A in response
to context c, then let us “model the world” with the simple linear model so that v(c, a) ≈ hθ∗, φ(c, a)i for
some unknown θ∗ ∈ R
10d
. Of course, when actually playing the game we will observe image features ct as
the context, choose an “action” at ∈ {0, . . . , 9}, and receive reward rt = 1{at = yt} where yt is the true
label of the image ct and at is the action played.
Implement the Explore-Then-Commit algorithms, Follow-The-Leader, LinUCB, and Thompson Sampling
algorithms for this problem. You can use just the training set of T = 50000 examples. The training set is
class balanced meaning that there are 5000 examples of each digit. Important: randomly shuffle the dataset
so the probability of any particular class showing up at any given time is 1/10. The algorithms work as
follows:
• Explore-Then-Commit (“Model the world”): Fix τ ∈ [T]. For the first τ steps, select each action
a ∈ A uniformly at random. Compute θb = arg minθ

t=1(rt − hφ(ct, at), θi)
2
. For t > τ play at =
arg maxa∈Ahφ(ct, a), θbi. Choose a value of τ and justify it.
• Explore-Then-Commit (“Model the bias”): Fix τ ∈ [T]. For the first τ steps, select each action a ∈
A uniformly at random. Our goal is to identify a policy πb : C → A using the dataset {(ct, at, pt, rt)}t≤τ
such that
πb = arg max
π∈Π

t=1
rt1{π(ct) = at}
pt
= arg min
π∈Π

t=1
rt1{π(ct) 6= at}
pt
= arg min
π∈Π
X
t∈[τ]:rt=1
1{π(ct) 6= at}
where the last line uses the fact that pt = 1/10 due to uniform exploration and the definition of rt.
Note that this is just a multi-class classification problem on dataset {(ct, at)}t∈[τ]:rt=1 where one is
trying to identify a classifier πb : C → A that predicts label at from features ct. Train a 10-class linear
logistic classifier3 πb on the data up to time [τ ] and then for t > τ play at = arg maxa∈{0,…,9} πb(ct).
Choose the same value of τ as “Model the world”.
1http://yann.lecun.com/exdb/mnist/
2https://pypi.org/project/python-mnist/
3Please feel free to use an off-the-shelf method to train logistic regression such as https://scikit-learn.org/stable/auto_
examples/linear_model/plot_iris_logistic.html#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py
1
• Follow-The-Leader: Fix τ ∈ [T]. For the first τ steps, select each action a ∈ A uniformly at random.
For t > τ play at = arg maxa∈Ahφ(ct, a), θbt−1i where θbt = arg minθ
Pt
s=1(rs − hφ(cs, as), θi)
2
. Choose
a value of τ and justify it.
• LinUCB Using Ridge regression with an appropriate γ > 0 (γ = 1 may be okay) construct the confidence set Ct derived in class (and in the book). At each time t ∈ [T] play at = arg maxa∈A maxθ∈Ct
hθ, φ(ct, a)i.
• Thompson Sampling Fix γ > 0 (γ = 1 may be okay). At time t ∈ [T] draw θet ∼ N (θbt−1, V −1
t−1
)
and play at = arg maxa∈Ahθet, φ(ct, a)i where θbt = arg minθ
Pt
s=1(rs − hθ, φ(cs, as)i)
2 and Vt = γI +
Pt
s=1 φ(cs, as)φ(cs, as)
>.
Implement each of these algorithms and show a plot of the regret (all algorithms on one plot) when run
on MNIST for good choices of τ, γ. Hint, for computing V
−1
t
efficiently see https://en.wikipedia.org/
wiki/Sherman%E2%80%93Morrison_formula.
2