# Gradient Boosting Algorithm for Classification Problem

Published:

In the previous post I mentioned about how Gradient Boosting algorithm works for a regression problem.

This time, we’re gonna look at how the same algorithm works for a classification problem.

The followings are the inputs required before applying the algorithm.

• training data
• differentiable loss function `L(y, F(x))` where `y` is the actual value and `F(x)` is the prediction value
• number of trees (`M`)

For the sake of clarity, let’s use the following example instances as our training data (`y` is the dependent variable).

``````A  | B  | C  | y
=================
A1 | B1 | C1 | P
A2 | B2 | C2 | P
A3 | B3 | C3 | Q
``````

Let’s execute the algorithm.

## Initialize the loss function

In this demonstration, we’ll use the following as the loss function.

``````L(yi, F(xi)) = -yi * log(odds) + log(1 + e^log(odds))

where odds = F(xi) / (1 - F(xi)) and F(xi) is the predicted probability for an instance i
``````

## Initialize model with a constant value

As the very first step, we build a simple model that just returns a constant probability `C`. We’d like to search for a constant `C` that minimizes the sum of loss function.

Formally, we’d like to solve for the following optimization problem.

``````F0(x) = argmin(C) SUM(i=1 to n) L(yi, C)
``````

Since the loss function is expressed in terms of `log(odds)`, we could define `C` equals to the predicted `log(odds)` rather than the predicted probability.

Let’s expand the above equation by plugging in the loss function. Note that each of them will have the same `odds` and `log(odds)` for the base model.

``````L(y1, C) = -1 * log(odds) + log(1 + e^log(odds))
L(y2, C) = -1 * log(odds) + log(1 + e^log(odds))
L(y3, C) = -0 * log(odds) + log(1 + e^log(odds))

SUM = L(y1, C) + L(y2, C) + L(y3, C)
``````

Searching for `C` that minimizes `SUM` may be accomplished by taking the derivative of `SUM` with respect to `log(odds)` and set it to zero.

``````dL(y1, C)/d(log(odds)) = -1 + (e^log(odds) / (1 + e^log(odds)))
dL(y2, C)/d(log(odds)) = -1 + (e^log(odds) / (1 + e^log(odds)))
dL(y3, C)/d(log(odds)) = -0 + (e^log(odds) / (1 + e^log(odds)))

Let p = e^log(odds) / (1 + e^log(odds))

Sum of the above derivatives are -1 + p + -1 + p + p = 3p - 2

Setting the sum to zero will yield p = 2/3 or log(odds) = ln(2)
``````

So, our base model will return `ln(2)` for the `log(odds)` of each instance.

## Construct the decision tree regressor

In this step, we iterate `M` times (number of decision tree specified before) and perform the following actions.

### a) Compute the pseudo-residuals

For each instance in the training data, we compute the following.

``````residual_i_m = - [ dL(yi, F(xi)) / dF(xi) ] where F(x) = Fm-1(x) for i = 1, ..., n
``````

Let’s take a look at what `- [ dL(yi, F(xi)) / dF(xi) ]` really means.

``````Know that L(yi, F(xi)) = -yi * log(odds) + log(1 + e^log(odds))

And notice that F(xi) is represented by log(odds).

Taking the derivative of the loss function with respect to log(odds) yields dL(yi, F(xi)) / d(log(odds)) = -yi + (e^log(odds) / (1 + e^log(odds))).

Moving the minus sign to the LHS yields -dL(y, F(x)) / d(log(odds)) = yi - (e^log(odds) / (1 + e^log(odds))).

Or to be more specific, since F(x) = Fm-1(x), then our residual becomes -dL(y, Fm-1(x)) / d(log(odds)) = yi - (e^log(odds) / (1 + e^log(odds))) where we use log(odds) for Fm-1(x).
``````

Applying this step to our train data will yield the following result. In the below table, `residual_1` means the error values for the first tree regressor.

``````A  | B  | C  | y  | F0(x) | residual_1
============================================================================
A1 | B1 | C1 | P  | ln(2) | 1-(e^ln(2) / (1 + e^ln(2))) = 1-(2/(1+2)) = 1/3
A2 | B2 | C2 | P  | ln(2) | 1-(e^ln(2) / (1 + e^ln(2))) = 1-(2/(1+2)) = 1/3
A3 | B3 | C3 | Q  | ln(2) | 0-(e^ln(2) / (1 + e^ln(2))) = 0-(2/(1+2)) = -2/3
``````

### b) Fit a base learner to the pseudo-residuals

In this step, we train a weak model (decision tree regressor) on the training data where `residual_i_m` becomes the dependent variable.

Let’s call the model `h_m(x)`.

For instance, let’s apply our `h_m(x)` to our previously training data. In this case, `h_m(x)` becomes `h_1(x)` since it’s our first weak model.

Note that the values of `h_1(x)` are dummy values.

``````A  | B  | C  | y  | F0(x) | residual_1 | h_1(x)
================================================
A1 | B1 | C1 | P  | ln(2) |    1/3     |  0.5
A2 | B2 | C2 | P  | ln(2) |    1/3     |  0.3
A3 | B3 | C3 | Q  | ln(2) |   -2/3     |  0.3
``````

### c) Compute multiplier for the weak model

In this step, we solve the following optimization problem. We’d like to find the multiplier (`gamma`) that minimizes the sum of loss functions.

``````gamma_min = argmin(gamma) SUM(i=1 to n) L(yi, Fm-1(xi) + gamma * h_m(xi))
``````

Let’s apply this step on our training data (the one with `h_m(x)` column).

``````Know that L(yi, F(xi)) = -yi * log(odds) + log(1 + e^log(odds))

And notice that F(xi) is represented by log(odds).

sum_loss_func = L(y1, F0(x1) + gamma * h_1(x1)) + L(y2, F0(x2) + gamma * h_1(x2)) + L(y3, F0(x3) + gamma * h_1(x3))

sum_loss_func = -y1 * (F0(x1) + gamma * h_1(x1)) + log(1 + e^(F0(x1) + gamma * h_1(x1))) \
+ -y2 * (F0(x2) + gamma * h_1(x2)) + log(1 + e^(F0(x2) + gamma * h_1(x2))) \
+ -y3 * (F0(x3) + gamma * h_1(x3)) + log(1 + e^(F0(x3) + gamma * h_1(x3)))

sum_loss_func = -1 * (ln(2) + gamma * 0.5) + log(1 + e^(ln(2) + gamma * 0.5)) \
+ -1 * (ln(2) + gamma * 0.3) + log(1 + e^(ln(2) + gamma * 0.3)) \
+ -0 * (ln(2) + gamma * 0.3) + log(1 + e^(ln(2) + gamma * 0.3))

sum_loss_func = -(ln(2) + gamma * 0.5) + log(1 + e^(ln(2) + gamma * 0.5)) \
+ -(ln(2) + gamma * 0.3) + log(1 + e^(ln(2) + gamma * 0.3)) \
+ log(1 + e^(ln(2) + gamma * 0.3))
``````

Afterwards, let’s take the first derivative of `sum_loss_func` with respect to `gamma` and set the equation to zero.

``````d(sum_loss_func) = {-0.5 + [0.5 * (e^(ln(2) + gamma * 0.5)) / (1 + e^(ln(2) + gamma * 0.5))]} \
----------------   + {-0.3 + [0.3 * (e^(ln(2) + gamma * 0.3)) / (1 + e^(ln(2) + gamma * 0.3))]} \
d(gamma)       + {[0.3 * (e^(ln(2) + gamma * 0.3)) / (1 + e^(ln(2) + gamma * 0.3))]}
``````

Setting the above to zero yields means that the numerator is zero and the denumerator can’t be zero.

``````d(sum_loss_func) = {-0.8 + [0.5 * (e^(ln(2) + gamma * 0.5)) / (1 + e^(ln(2) + gamma * 0.5))]} \
----------------   + {[0.3 * (e^(ln(2) + gamma * 0.3)) / (1 + e^(ln(2) + gamma * 0.3))]} \
d(gamma)       + {[0.3 * (e^(ln(2) + gamma * 0.3)) / (1 + e^(ln(2) + gamma * 0.3))]}

0 = [-0.8 * {(1 + e^(ln(2) + gamma * 0.5))} * {(1 + e^(ln(2) + gamma * 0.3))}] + {0.5 * (e^(ln(2) + gamma * 0.5))} * {(1 + e^(ln(2) + gamma * 0.3))} \
+ {0.6 * (e^(ln(2) + gamma * 0.3))} * {(1 + e^(ln(2) + gamma * 0.5))}

0 = [-0.8 - 0.8*(e^ln(2) * e^(gamma * 0.5)) * {1 + (e^ln(2) * e^(gamma * 0.3))}] + {0.5 * e^ln(2) * e^(gamma * 0.5)} * {1 + (e^ln(2) * e^(gamma * 0.3))} \
+ {0.6 * e^ln(2) * e^(gamma * 0.3)} * {1 + (e^ln(2) * e^+(gamma * 0.5))}

...
...
``````

Well, as you can see, finding the `gamma` by taking the derivative of the sum of loss function with respect to `gamma` is hard.

Fortunately, we can approximate the loss function with the second order Taylor polynomials shown below.

``````Let p = gamma * h_m(xi),

L(yi, Fm-1(xi) + p) ~ L(yi, Fm-1(xi)) + p * d(yi, Fm-1(xi))/dF() + 0.5 * p^2 * d^2(yi, Fm-1(xi))/dF()^2
``````

Now, let’s apply the above approximation on each instance, sum them up, take the derivative with respect to `gamma`, and set the equation to zero.

``````p = gamma * h_1(x1)
q = gamma * h_1(x2)
r = gamma * h_1(x3)

L(y1, F0(x1) + p) ~ L(y1, F0(x1)) + [p * d(y1, F0(x1))/dF()] + [0.5 * p^2 * d^2(y1, F0(x1))/dF()^2]
L(y2, F0(x2) + q) ~ L(y2, F0(x2)) + [q * d(y2, F0(x2))/dF()] + [0.5 * q^2 * d^2(y2, F0(x2))/dF()^2]
L(y3, F0(x3) + r) ~ L(y3, F0(x3)) + [r * d(y3, F0(x3))/dF()] + [0.5 * r^2 * d^2(y3, F0(x3))/dF()^2]

sum_loss_func = L(y1, F0(x1)) + [p * d(y1, F0(x1))/dF()] + [0.5 * p^2 * d^2(y1, F0(x1))/dF()^2] \
+ L(y2, F0(x2)) + [q * d(y2, F0(x2))/dF()] + [0.5 * q^2 * d^2(y2, F0(x2))/dF()^2] \
+ L(y3, F0(x3)) + [r * d(y3, F0(x3))/dF()] + [0.5 * r^2 * d^2(y3, F0(x3))/dF()^2]

d(sum_loss_func) / d(gamma) = [h_1(x1) * d(y1, F0(x1))/dF()] + [(gamma * h_1(x1)^2) * d^2(y1, F0(x1))/dF()^2] \
+ [h_1(x2) * d(y2, F0(x2))/dF()] + [(gamma * h_1(x2)^2) * d^2(y2, F0(x2))/dF()^2] \
+ [h_1(x3) * d(y3, F0(x3))/dF()] + [(gamma * h_1(x3)^2) * d^2(y3, F0(x3))/dF()^2]

d(sum_loss_func) / d(gamma) = [h_1(x1) * (-y1 + (e^log(odds) / (1 + e^log(odds))))] \
+ [h_1(x2) * (-y2 + (e^log(odds) / (1 + e^log(odds))))] \
+ [h_1(x3) * (-y3 + (e^log(odds) / (1 + e^log(odds))))] \
+ [gamma * h_1(x1)^2 * e^log(odds) / (1 + e^log(odds))^2] \
+ [gamma * h_1(x2)^2 * e^log(odds) / (1 + e^log(odds))^2] \
+ [gamma * h_1(x3)^2 * e^log(odds) / (1 + e^log(odds))^2]

For simplicity, let's denote e^log(odds) / (1 + e^log(odds)) as the predicted probability of each instance.

d(sum_loss_func) / d(gamma) = [h_1(x1) * (-y1 + p1)] \
+ [h_1(x2) * (-y2 + p2)] \
+ [h_1(x3) * (-y3 + p3)] \
+ [gamma * h_1(x1)^2 * (p1 * (1 - p1))] \
+ [gamma * h_1(x2)^2 * (p2 * (1 - p2))] \
+ [gamma * h_1(x3)^2 * (p3 * (1 - p3))]
``````

Let’s set the above equation to zero to get the optimal `gamma`.

``````0 = [h_1(x1) * (-y1 + p1)] \
+ [h_1(x2) * (-y2 + p2)] \
+ [h_1(x3) * (-y3 + p3)] \
+ [gamma * h_1(x1)^2 * (p1 * (1 - p1))] \
+ [gamma * h_1(x2)^2 * (p2 * (1 - p2))] \
+ [gamma * h_1(x3)^2 * (p3 * (1 - p3))]

-[h_1(x1) * (-y1 + p1)] \
-[h_1(x2) * (-y2 + p2)] \
-[h_1(x3) * (-y3 + p3)] = [gamma * h_1(x1)^2 * (p1 * (1 - p1))] \
+ [gamma * h_1(x2)^2 * (p2 * (1 - p2))] \
+ [gamma * h_1(x3)^2 * (p3 * (1 - p3))]
``````

Since we’re building the first tree (`m` = 1), then the predicted `log(odds)` is `F0(x)` or the constant value for the base model-that is `ln(2)`.

Converting the predicted `log(odds)` to the predicted probability is achieved by the following.

``````p = e^log(odds) / (1 + e^log(odds))
p = e^ln(2) / (1 + e^ln(2))
p = 2 / 3
``````

Since `F0(x)` is the predicted `log(odds)` from the base model for each instance, then `p1, p2, p3` will all have the same value, namely 2/3.

Plugging in the value to the above equation yields the following.

``````Recall that h_1(x1) = 0.5, h_1(x2) = 0.3 and h_1(x3) = 0.3

-[0.5 * (-1 + 2/3)] \
-[0.3 * (-1 + 2/3)] \
-[0.3 * (0 + 2/3)] = [gamma * 0.5^2 * (2/3 * (1 - 2/3))] \
+ [gamma * 0.3^2 * (2/3 * (1 - 2/3))] \
+ [gamma * 0.3^2 * (2/3 * (1 - 2/3))]

1/15 = 43/450 * gamma

gamma = 30/43
``````

Finally, we got the optimal `gamma` equals to 30/43.

### d) Update the model

In this step, we calculate the following.

``````Fm(x) = Fm-1(x) + gamma_min * h_m(x)

where gamma_min is the gamma calculated on the previous step.
``````

Applying the above formula to our training data, we’ll calculate `F1(x) = F0(x) + gamma_min * h_1(x)` and get the following.

``````A  | B  | C  | y  | F0(x) | residual_1 | h_1(x) |        F1(x)
=======================================================================
A1 | B1 | C1 | P  | ln(2) |    1/3     |  0.5   | ln(2) + (30/43) * 0.5
A2 | B2 | C2 | P  | ln(2) |    1/3     |  0.3   | ln(2) + (30/43) * 0.3
A3 | B3 | C3 | Q  | ln(2) |   -2/3     |  0.3   | ln(2) + (30/43) * 0.3
``````

## Output the final model

After constructing `M` base learners for pseudo-residuals, we finally come up with our final model, that is `FM(x)`.