Sale!

DSGA1003 Homework 6 Ensemble Methods Solved

Original price was: $40.00.Current price is: $35.00. $29.75

Category:

Description

5/5 - (1 vote)

Ensemble Methods

1 Gradient Boosting Machines

Recall the general gradient boosting algorithm1
, for a given loss function ` and a hypothesis space
F of regression functions (i.e. functions mapping from the input space to R):

1. Initialize f0(x) = 0.
2. For m = 1 to M:
(a) Compute:
gm =



∂f(xj )
Xn
i=1
` (yi
, f(xi))

f(xi)=fm−1(xi), i=1,…,n


n
j=1
(b) Fit regression model to −gm:
hm = arg min
h∈F
Xn
i=1
((−gm)
i − h(xi))2

(c) Choose fixed step size νm = ν ∈ (0, 1], or take
νm = arg min
ν>0
Xn
i=1
` (yi
, fm−1(xi) + νhm(xi)).
(d) Take the step:
fm(x) = fm−1(x) + νmhm(x)

1Besides the lecture slides, you can find an accessible discussion of this approach in http://www.saedsayad.
com/docs/gbm2.pdf, in one of the original references http://statweb.stanford.edu/~jhf/ftp/trebst.pdf,
and in this review paper http://web.stanford.edu/~hastie/Papers/buehlmann.pdf.

3. Return fM In this problem we’ll derive two special cases of the general gradient boosting framework: L2-
Boosting and BinomialBoost.
1. Consider the regression framework, where Y = R. Suppose our loss function is given by
`(ˆy, y) = 1
2
(ˆy − y)
2

and at the beginning of the m’th round of gradient boosting, we have the function fm−1(x).
Show that the hm chosen as the next basis function is given by
hm = arg min
h∈F
Xn
i=1
[(yi − fm−1(xi)) − h(xi)]2

In other words, at each stage we find the weak prediction function hm ∈ F that is the best fit
to the residuals from the previous stage. [Hint: Once you understand what’s going on, this is
a pretty easy problem.]

2. Now let’s consider the classification framework, where Y = {−1, 1}. In lecture, we noted that
AdaBoost corresponds to forward stagewise additive modeling with the exponential loss, and
that the exponential loss is not very robust to outliers (i.e. outliers can have a large effect on
the final prediction function). Instead, let’s consider the logistic loss
`(m) = ln
1 + e
−m


where m = yf(x) is the margin. Similar to what we did in the L2-Boosting question, write
an expression for hm as an argmin over F.

2 From Margins to Conditional Probabilities2

Let’s consider the classification setting, in which (x1, y1), . . . ,(xn, yn) ∈ X × {−1, 1} are sampled
i.i.d. from some unknown distribution. For a prediction function f : X → R, we define the margin
on an example (x, y) to be m = yf(x). Since our class predictions are given by sign(f(x)), we
see that a prediction is correct iff m(x) > 0. We have said we can interpret the magnitude of
the margin |m(x)| as a measure of confidence.

However, it is not clear what the “units” of the
margin are, so it is hard to interpret the magnitudes beyond saying one prediction is more or less
confident than another. In this problem, we investigate how we can translate the margin into a
conditional probability, which is much easier to interpret. In other words, we are looking for a
mapping m(x) 7→ p(y = 1 | x).

In this problem we will consider margin-based losses. A loss function is a margin-based loss
if it can be written in terms of the margin m = yf(x). We are interested in how we can go from
an empirical risk minimizer of a margin loss, ˆf = arg minf∈F Pn
i=1 ` (yif(xi)), to a conditional
probability estimator πˆ(x) ≈ p(y = 1 | x). Our approach will be to try to find a way to use
2This problem is based on Section 7.5.3 of Schapire and Freund’s book Boosting: Foundations and Algorithms.

the Bayes prediction function3 f

∗ = arg minf Ex,y [`(yf(x)] to get the true conditional probability
p(y = 1 | x), and then apply the same mapping to the empirical risk minimizer. While there
is plenty that can go wrong with this “plug-in” approach (primarily, the empirical risk minimizer
from a hypothesis space F may be a poor estimate for the Bayes prediction function), it is at
least well-motivated, and it can work well in practice. And please note that we can do better
than just hoping for success: if you have enough validation data, you can directly assess how well
“calibrated” the predicted probabilities are. This blog post has some discussion of calibration plots:
https://jmetzen.github.io/2015-04-14/calibration.html.

It turns out it is straightforward to find the Bayes prediction function f

for margin losses, at
least in terms of the data-generating distribution: For any given x ∈ X , we’ll find the best possible
prediction yˆ. This will be the yˆ that minimizes
Ey [` (yyˆ) | x] .

If we can calculate this yˆ for all x ∈ X , then we will have determined f

(x). We will simply take
f

(x) = arg min

Ey [` (yyˆ) | x] .

Below we’ll calculate f

for several loss functions. It will be convenient to let π(x) = P (y = 1 | x)
in the work below.
1. Write Ey [` (yf(x)) | x] in terms of π(x) and ` (f(x)). [Hint: Use the fact that y ∈ {−1, 1}.]

2. Show that the Bayes prediction function f

(x) for the exponential loss function ` (y, f(x)) =
e
−yf(x)
is given by
f


(x) = 1
2
ln 
π(x)
1 − π(x)

and, given the Bayes prediction function f

, we can recover the conditional probabilities by
π(x) = 1
1 + e−2f
∗(x)

[Hint: Differentiate the expression in the previous problem with respect to f(x). To make
things a little less confusing, and also to write less, you may find it useful to change variables
a bit: Fix an x ∈ X . Then write p = π(x) and yˆ = f(x). After substituting these into
the expression you had for the previous problem, you’ll want to find yˆ that minimizes the
expression. Use differential calculus. Once you’ve done it for a single x, it’s easy to write the
solution as a function of x.]

3. Show that the Bayes prediction function f

(x) for the logistic loss function ` (y, f(x)) =
ln
1 + e
−yf(x)

is given by
f

(x) = ln 
π(x)
1 − π(x)

3

In this context, the Bayes prediction function is often referred to as the “population minimizer.” In our case,
“population” referes to the fact that we are minimizing with respect to the true distribution, rather than a sample.
The term “population” arises from the context where we are using a sample to approximate some statistic of an entire
population (e.g. a population of people or trees).

 

and the conditional probabilities are given by
π(x) = 1
1 + e−f
∗(x)

Again, we may assume that π(x) ∈ (0, 1).
4. [Optional] Show that the Bayes prediction function f

(x) for the hinge loss function ` (y, f(x)) =
max (0, 1 − yf(x)) is given by
f

(x) = sign 
π(x) −
1
2


Note that it is impossible to recover π(x) from f

(x) in this scenario. However, in practice
we work with an empirical risk minimizer, from which we may still be able to recover a
reasonable estimate for π(x). An early approach to this problem is known as “Platt scaling”:
https://en.wikipedia.org/wiki/Platt_scaling.

3 AdaBoost Actually Works [Optional]
Introduction

Given training set D = {(x1, y1), . . . ,(xn, yn)}, where yi
’s are either +1 or −1, suppose we have a
weak learner Gt at time t and we will perform T rounds of AdaBoost. Initialize observation weights
uniformly by setting W1 = (w
1
1
, . . . , w1
n
) with w
1
i = 1 for i = 1, 2, . . . , n. For t = 1, 2, . . . , n:
1. Fit the weak learner Gt at time t to training set D with weighting Wt

2. Compute the weighted misclassification error: errt =
P
D w
t
i
1(Gt(xi) 6= yi)/
P
i w
t
i

3. Compute the contribution coefficient for the weak learner: αt = log( 1
errt
− 1)
4. Update the weights: w
t+1
i = w
t
i
exp(−αtyiGt(xi))

After T steps, the cumulative contributions of weak learners is G(x) = sign(
PT
t=1 αtGt(x)) as the
final output. We will prove that with a reasonable weak learner the error of the output decreases
exponentially fast with the number of iterations.

Exponential bound on the training loss
More precisely, we will show that the training error L(G, D) = 1
n
Pn
i=1 1{G(xi)6=yi} ≤ exp(−2γ
2T)

where the error of the weak learner is less than 1/2 − γ for some γ > 0. To start, let’s denote two
cumulative variables: the output at time t as ft =
P
s≤t αsGs and Zt =
1
n
Pn
i=1 exp(−yift(xi)).

1. For any function g, show that 1{g(x)6=y} < exp (−yg(x)).
2. Use this to show L(G, D) < ZT
4
3. Show that w
t+1
i = exp(−yift(xi))
4. Use part 3 to show Zt+1
Zt
= 2p
errt+1(1 − errt+1) (Hint: use the definition of weight updates
and separate the sum on where Gt is equal to 1 and −1.)

5. Show that the function g(a) = a(1 − a) is monotonically increasing on [0, 1/2]. Show that
1−a ≤ exp(−a). And use the assumption on the weak learner to show that Zt+1
Zt
≤ exp(−2γ
2
)

6. Conclude the proof!
4 AdaBoost is FSAM With Exponential Loss [Optional]
The AdaBoost score function G(x) = PT
t=1 βtGt(x) is a linear combination (actually a conic combination) of functions. (The prediction function is, of course, the sign of the score function.) Forward
stagewise additive modeling (FSAM) is another approach to fitting a function of this form.

In FSAM, we have a base hypothesis space H of real-valued functions h : X → R and a loss
function ` (y, yˆ). In FSAM, we attempt to find a linear combination of h’s in H that minimize
the empirical risk. The procedure initializes f0(x) = 0, and then repeats the following steps for
t = 1, . . . , T:
1. (βt, ht) = argminβ∈R,h∈H Pn
i=1 `(yi
, ft−1(xi) + βh(xi))
2. ft(x) = ft−1(x) + βtht(x)

Exponential loss and AdaBoost
Consider a generic input space X , the classification outcome space Y = {−1, 1}, the exponential
loss function `(y, f(x)) = exp(−yf(x)), and an arbitrary base hypothesis space H consisting of
{−1, 1}-valued functions. We will show that FSAM in this setting is equivalent to a version of
AdaBoost (Algorithm 1) described below. To get this equivalence, we either need to assume that
FSAM chooses nonnegative step sizes, i.e. βt ≥ 0, or we need to assume that H is symmetric, in
the sense that if h ∈ H, then −h ∈ H as well.

1. Write the first step of FSAM using the exponential loss function. In particular, show that the
FSAM optimization problem can be written as a minimization of a weighted exponential loss
of the step βh:
(βt, ht) = argminβ,h∈H 
1
Pn
i=1 wt
i
Xn
i=1
w
t
i
exp(−yiβh(xi)),
where w

t
i = exp(−yift−1(xi)). (Note that for any t, if we rescale each of w
t
1
, . . . , wt
n by the
same constant factor, there is no effect on the arg min. Thus the first factor (
Pn
i=1 w
t
i
)
−1
can
be dropped.

 

However, we keep it so we can refer to the expression as a weighted mean.)
5
2. Define the weighted 0/1 error of h at round t to be
errt(h) = 
1
Pn
i=1 wt
i
Xn
i=1
w
t
i1(yi 6= h(xi)).
(It’s the weights that are specific to round t.) Show that the weighted exponential loss at
round t can be written in terms of the weighted 0/1 error.

 

Specifically, show that for any
β ∈ R, we have

1
Pn
i=1 wt
i
Xn
i=1
w
t
i
exp(−βyih(xi)) = e
−β +

e
β − e
−β

errt(h).
[Hint: Use indicators 1(h(xi) 6= yi) and 1(h(xi) = yi) to split the summand on the LHS
into pieces. Each piece simplifies, since yi
, h(xi) ∈ {−1, 1}. Then note that 1(h(xi) = yi) =
1 − 1(h(xi) 6= yi).]

3. We now would like to show that for any fixed “step size” β, the optimal “step direction” h, for
which βh minimizes the weighted exponential loss, can be found by minimizing the weighted
0/1 error of h. But more precisely, show that if β ≥ 0 then
1argminh∈H 
1
Pn
i=1 wt
i
Xn
i=1
w
t
i
exp(−βyih(xi)) = argminh∈Herrt(h).

Also show that if β < 0 then
argminh∈H 
1
Pn
i=1 wt
i
Xn
i=1
w
t
i
exp(−βyih(xi)) = argminh∈Herrt(−h).

4. Show that if H is symmetric, in the sense that h ∈ H implies −h ∈ H, then there is always
an optimal FSAM step (βt, ht) with βt ≥ 0. Thus if we assume that either H is symmetric or
FSAM chooses nonnegative step sizes, then we can conclude that
ht = argminh∈Herrt(h)
is a solution to ht in the minimization problem in the first part, and thus is the FSAM step
direction in round t.

5. Now that we’ve found ht, show that the corresponding optimal step size is given by βt =
1
2
log 
1−errt
errt

, where we let errt = errt(ht) as a shorthand. [Hint: You’ll need to use some
differential calculus. Show that what you’ve found is a minimum by showing that the function
you’re differentiating is convex.]

6. Show that
w
t+1
i =
(
e
−βtw

t
i
if yi = ht(xi)
e
−βtw
t
i
e
2βt otherwise,

 

[Hint: First show that w
t+1
i = w
t
i
exp(−βtyiht(xi)). Then write yiht(xi) in terms of the
indicator function yi 6= ht(xi).]

 

7. Let’s introduce a specific instance of AdaBoost we’ll call “Exact AdaBoost”, given in Algorithm
1. The only difference between Exact AdaBoost and AdaBoost is that in Exact AdaBoost,
Algorithm 1: Exact AdaBoost
input: Training set D = ((x1, y1), . . . ,(xn, yn)) ∈ X × {−1, 1}
w
1
i = 1 for i = 1, . . . , n #Initialize weights
for t = 1, . . . , T:
ht = arg minh∈H Pn
i=1 w
t
i

1(yi 6= h(xi))
errt = errt(ht) = 
P
1
n
i=1 wt
i
Pn
i=1 w
t
i

1(yi 6= h(xi))
αt = ln 
1−errt
errt

w
t+1
i =
(
w
t

i if yi = ht(xi)
w
t
i
e
αt otherwise,
for i = 1, . . . , n
return f =
PT
t=1 αtht #Returns the score function.
(Predictions are x 7→ sign(f(x))).

we require that the base classifier return the best possible h ∈ H, while in AdaBoost we
only vaguely stated that the “base learner fits the weighted training data”, but there was
no requirement that the result be the best possible. Indeed, since a typical base classifier
is decision trees, and it’s computationally prohibitive to find the best possible tree, Exact
AdaBoost is not usually an implementable algorithm. Show that the score functions returned
by Exact Adaboost and by FSAM (in our setting) differ only by a constant factor, and of
course the hard classifications will be exactly the same.

8. Suppose our ultimate goal is to find the score function returned by FSAM after T rounds
in the context described above. Suppose we only have access to an implementation of Exact
AdaBoost described in Algorithm 1, and it returns the score function f(x). What would be
the score function returned by FSAM?

5 [Optional] Decision Tree Implementation

In this problem we’ll implement decision trees for both classification and regression. The strategy
will be to implement a generic class, called Decision_Tree, which we’ll supply with the loss function
we want to use to make node splitting decisions, as well as the estimator we’ll use to come up with
the prediction associated with each leaf node.

For classification, this prediction could be a vector
of probabilities, but for simplicity we’ll just consider hard classifications here. We’ll work with the
classification and regression data sets from Homework #4.

1. [Optional] Complete the class Decision_Tree, given in the skeleton code. The intended implementation is as follows: Each object of type Decision_Tree represents a single node of the
tree. The depth of that node is represented by the variable self.depth, with the root node
having depth 0. The main job of the fit function is to decide, given the data provided, how
to split the node or whether it should remain a leaf node.

If the node will split, then the
splitting feature and splitting value are recorded, and the left and right subtrees are fit on the
relevant portions of the data. Thus tree-building is a recursive procedure. We should have as
many Decision_Tree objects as there are nodes in the tree. We will not implement pruning
here. Some additional details are given in the skeleton code.

2. [Optional] Complete either the compute_entropy or compute_gini functions. Run the code
provided that builds trees for the two-dimensional classification data. Include the results. For
debugging, you may want to compare results with sklearn’s decision tree. For visualization,
you’ll need to install graphviz.

3. [Optional] Complete the function mean_absolute_deviation_around_median (MAE). Use
the code provided to fit the Regression_Tree to the krr dataset using both the MAE loss and
median predictions. Include the plots for the 6 fits.

6 Gradient Boosting Implementation

This method goes by many names, including gradient boosting machines (GBM), generalized boosting models (GBM), AnyBoost, and gradient boosted regression trees (GBRT), among others. Although one of the nice aspects of gradient boosting is that it can be applied to any problem with a
subdifferentiable loss function, here we’ll keep things simple and consider the standard regression
setting with square loss.

1. Complete the gradient_boosting class. As the base regression algorithm, you should use
the regression tree from the previous problem, if you completed it. Otherwise, you may use
sklearn’s regression tree. You should use the square loss for the tree splitting rule and the
mean function for the leaf prediction rule.

Run the code provided to build gradient boosting
models on the classification and regression data sets, and include the plots generated. Note
that we are using square loss to fit the classification data, as well as the regression data.

2. [Optional] Repeat the previous runs on the classification data set, but use a different classification loss, such as logistic loss or hinge loss. Include the new code and plots of your
results. Note that you should still use the same regression tree settings for the base regression
algorithm.
8