Sale!

SOLVED: CS698X Homework 2

Original price was: $35.00.Current price is: $30.00.

Download Details:

  • Name: A2-21zhrw.zip
  • Type: zip
  • Size: 712.14 KB

Category:

Description

5/5 - (3 votes)

Problem 1: Batch Bayesian Active Learning (40 marks)
For this question, you need to read the paper “Bayesian Batch Active Learning as Sparse Subset Approximation”
by Pinsler et al (appeared at NeurIPS 2019; link at https://arxiv.org/pdf/1908.02144.pdf), and
summarize the key ideas/answer a few questions in your own words. In particular:
• Equation (4) in the paper defines “expected complete data log posterior”. What does this quantity mean
and how is it used to select a batch D0 of inputs whose labels should be queried from the oracle? Briefly
describe how the sparse approximation based objective function will accomplish this along with a brief
intuitive meaning of this objective function. Your answer should be in your own words (do not use the
same text from the paper), using not more than 100 words and using only the key equations. (10 marks)
• As the paper discusses, the sparse approximation based objective is difficult to optimize. What relaxed
objective is minimized instead in the paper? You may write the relevant equations along with brief
explanation (this too should be in your own words; do not use the same text from the paper), using not
more than 200 words and using only the key terms and equations. (20 marks)
• For what kind of supervised learning models, the acquisition function proposed in the paper has a closed
form expression? For other type of models, where the acquisition function won’t be available, what scheme
does the paper propose to get closed form expressions? (10 marks)
For this question, please feel free to discuss with other students, but all parts of the answer must be in your own
words.
Problem 2: Local Conjugacy and Gibbs Sampling (15 marks)
Suppose we have N scalar observations x1, x2, . . . , xN drawn i.i.d. from a Gaussian N (x|µ, β−1
). The mean
µ has a Gaussian prior N (µ|µ0, s0) where µ0 is the mean and s0 is the variance of the Gaussian prior. The
precision β has a gamma prior Gamma(β|a, b) where a and b are the shape and rate parameters, respectively, of
the gamma prior. Derive the conditional posteriors for µ and β using the idea of local conjugacy. (10 marks)
Also describe how you will use these conditional posteriors in a Gibbs sampling algorithm to approximate the
joint posterior µ and β. Sketch all the steps of this Gibbs sampler. (5 marks)
Problem 3: EM for Sparse Modeling (25 marks)
Consider a linear regression model y = Xw +  with y = [y1, . . . , yN ]
> is the N × 1 response vector, X is
the N × D feature matrix, and  = [1, . . . , N ]
> is the N × 1 vector of i.i.d. Gaussian noise N (0, σ2
). Let us
assume the following prior on each entry of the weight vector w ∈ R
D
p(wd|σ, γd) = (
N (0, σ2v0), if γd = 0
N (0, σ2v1), if γd = 1
where v1 > v0 > 0. Further assume the priors p(γd) = Bernoulli(θ), d = 1, . . . , D, p(θ) = Beta(a0, b0), and
p(σ
2 CS698X
) = IG(ν/2, νλ/2), where IG denotes the inverse-gamma prior in its shape-scale paramerization. Note that
the prior on wd can also be written as p(wd|σ, γd) = N (0, σ2κγd
) with κγd = γdv1 + (1 − γd)v0.
• What is the effect of assuming the above prior on w (answer not using more than 50 words)?
• Derive an EM algorithm for doing inference for this model. Your algorithm should give the (conditional)
posterior over the weight vector w and point estimates (MAP) for the remaining unknowns (γ, σ2
, θ).
2
Problem 4: Gaussian Processes (30 marks)
Part 1: GP Posterior (10 marks)
When discussing about GP regression, we saw that we can bypass the computation of GP posterior and can
directly compute the posterior predictive p(y∗|y) for a new input x∗. Suppose we do care about the GP posterior
and would like to derive its expression, given training data (X, y) = {xn, yn}
N
n=1.
Assume a zero mean GP prior p(f) = GP(0, κ) which, from the GP definition, is equivalent to p(f) = N (0, K)
where f = [f(x1,), . . . , f(xN )]> is an N ×1 vector and K is the N ×N kernel matrix with Knm = κ(xn, xm).
Assume a likelihood model p(yn|xn, f) = N (yn|f(xn), σ2
), where f ∼ GP(0, κ). Derive the expression for
the GP posterior, i.e., p(f|y). You are free to use standard results for Gaussians.
Part 2: Visualizing GP Priors and Posteriors for Regression (20 marks)
Assume a GP prior GP(0, κ) where κ is the squared exponential (SE) kernel κ(x, x0
) = ρ
2
exp 

(x−x
0
)
2
`
2

.
Note that this is the scalar input version of the SE kernel, and it can be generalized to the vector input case by
replacing (x − x
0
)
2 by (x − x
0
)
>(x − x
0
). Assume ρ
2 = 1.
Our data will consist of scalar inputs and will be generated using the model y = sin(x) + n with  ∼ N (0, σ2
)
where σ
2 = 0.05. We will generate N = 100 uniformly spaced inputs x1, . . . , xN in the interval [0, 4π] and
generate the corresponding outputs y1, . . . , yN from the above model.
For each of the following 5 values of ` from [0.2, 0.5, 1, 2, 10], your task will be the following
• Draw a random sample from the GP prior p(f) = N (0, K) and plot it.
• Plot the mean of the GP posterior (from Part 1), again on the same figure but with a different color.
• On the same plot, also show the true function (sin(x)) evaluated at the generated inputs (use a different
color for this too). Note that this curve will be the same in all the 5 cases.
Your code should be submitted as a Python notebook.
You may use any library function to draw from a multivariate normal distribution. Also note that, when computing
the kernel matrix K, you may need to add a small positive number to the diagonal entries to make it invertible.
What difference do you see between the plots generated using ` = [0.2, 0.5, 1, 2, 10], in particular w.r.t. shapes
of prior/posterior of GP vs the true function?
Problem 5: (Speeding Up Gaussian Processes (30 marks)
Consider Gaussian Process (GP) regression where yn = f(xn) + n with f modeled by GP(0, κ) where GP
mean function is 0 and kernel/covariance function is κ, and noise n ∼ N (0, σ2
). For simplicity, we will assume
the noiseless setting, so yn = f(xn) = fn. Given N training inputs (X, f) = {xn, fn}
N
n=1, we have seen that
the posterior predictive distribution for a new input x∗ is
p(f∗|x∗, X, f) = N (f∗|k
>
∗ K−1f, κ(x∗, x∗) − k
>
∗ K−1k∗)
In the above, K is the N × N kernel matrix of training inputs and k∗ is N × 1 vector of kernel based similarities
of x∗ with each of the training inputs. As we know, the above has O(N3
) cost due to N × N matrix inversion.
Let’s consider a way to reduce this cost to make GPs more scalable. To do this, suppose there are another set of
pseudo training inputs Z = {z1, . . . , zM} with M  N, along with their respective noiseless pseudo outputs
t = {t1, . . . , tM} modeled by the same GP, i.e., tm = f(zm). Note again that (Z, t) are NOT known.
3 CS698X
Now assume the likelihood for each training output fn to be modeled by a posterior predictive having the same
form as the GP regression’s posterior predictive (given above) but with (Z, t) acting as “pseudo” training data.
p(fn|xn, Z, t) = N (fn|k˜>
∗ K˜ −1
t, κ(xn, xn) − k˜>
n K˜ −1k˜
n)
In the above, K˜ is the M × M kernel matrix of the pseudo inputs Z and k˜
n is the M × 1 vector of kernel based
similarities of xn with each of the pseudo inputs z1, . . . , zM.
For this problem setup, your goals are the following
1. Derive and write down the expression of the posterior predictive distribution for the output y∗ of a new
input x∗, i.e., p(y∗|x∗, X, f, Z). Note that here we are assuming that we have estimated the unknown
pseudo inputs Z which this posterior predictive will be conditioned on, in addition to the actual training
inputs and outputs (X, f). Note however that the pseudo outputs t will still need to be marginalized out to
obtain this posterior predictive, so your derivation must also do that. How does this posterior predictive for
y∗ compare with the usual GP’s posterior predictive for y∗ in terms of computational cost?
2. Part (1) requires the pseudo inputs Z. Since we don’t know them, we must estimate them from the actual
training data (X, f). Let’s use MLE-II to estimate Z. In particular, derive the expression for the marginal
likelihood p(f|X, Z). You do not have to show how to solve the optimization problem for MLE-II. Just
write down the MLE-II objective.
Note: Feel free to use properties of Gaussians (e.g., marginals from conditionals) to avoid deriving everything
from scratch.
4 CS698X