# Gradient Boosting Algorithm for Regression Problem

Published:

In this post, we’re going to look at how Gradient Boosting algorithm works in a regression problem.

Gradient Boosting is one of the boosting types along with Adaptive Boosting and Extreme Gradient Boosting (XGBoost).

The followings are the inputs required before applying the algorithm.

• training data
• differentiable loss function `L(y, F(x))` where `y` is the actual value and `F(x)` is the prediction value
• number of trees (`M`)

For the sake of clarity, let’s use the following example instances as our training data (`y` is the dependent variable).

``````A  | B  | C  | y
=================
A1 | B1 | C1 | 50
A2 | B2 | C2 | 70
A3 | B3 | C3 | 30
``````

Let’s execute the algorithm.

## Initialize model with a constant value

As the very first step, we build a simple model that just returns a constant value `C`. This `C` is the prediction value where the loss function is minimum.

Formally, we’d like to solve for the following optimization problem.

``````F0(x) = argmin(C) SUM(i=1 to n) L(yi, C)
``````

Let’s perform this step on our training data.

Know that the loss function for an instance is defined as `L(yi, F(xi)) = 0.5 * (yi - F(xi))^2`. Or in other words, the sum of our loss function becomes the following.

``````sum_loss_func = L(y1, F(x1)) + L(y2, F(x2)) + L(y3, F(x3))
sum_loss_func = 0.5 * (y1 - F(x1))^2 + 0.5 * (y2 - F(x2))^2 + 0.5 * (y3 - F(x3))^2
``````

Since the prediction values are the same over the train data, we can replace `F(x1), F(x2), F(x3)` to a constant of `C`. Plugging in the actual value to the above equation becomes the following.

``````sum_loss_func = 0.5 * (50 - C)^2 + 0.5 * (70 - C)^2 + 0.5 * (30 - C)^2
``````

And solve for the optimization problem can be performed by taking the first derivative of `sum_loss_func` with respect to `C` and assign the equation to zero, such as the following.

``````sum_loss_func = 0.5 * (50 - C)^2 + 0.5 * (70 - C)^2 + 0.5 * (30 - C)^2

d(sum_loss_function) = -(50 - C) - (70 - C) - (30 - C) = -150 + 3C
--------------------
dC

Assigning the result to zero yields 200 = 3C where C = 150 / 3 = 50.
``````

Our initial model is constructed so that it always return 50 for every data point.

## Construct the decision tree regressor

In this step, we iterate `M` times (number of decision tree specified before) and perform the following actions.

### a) Compute so-called pseudo-residuals

For each instance in the training data, we compute the following.

``````residual_i_m = - [ dL(yi, F(xi)) / dF(xi) ] where F(x) = Fm-1(x) for i = 1, ..., n
``````

If we take a look at the above `residual_i_m`, it’s simply the difference between the actual and the predicted value. Here’s how we’d derive it.

``````Know that L(y, F(x)) = 0.5 * (y - F(x))^2.

Taking the derivative of the loss function with respect to F(x) yields dL(y, F(x)) / dF(x) = -(y - F(x)).

Moving the minus sign to the LHS yields -dL(y, F(x)) / dF(x) = y - F(x).

Or to be more specific, we'll finally have -dL(y, Fm-1(x)) / dFm-1(x) = y - Fm-1(x).
``````

The last line of the above snippet concludes that `residual_i_m` is the same as `yi - Fm-1(xi)` for an instance `i`.

Applying this step to our train data will yield the following result. In the below table, `residual_1` means the error values for the first tree regressor.

``````A  | B  | C  | y  | F0(x) | residual_1
========================================
A1 | B1 | C1 | 50 | 50    |    0
A2 | B2 | C2 | 70 | 50    |    20
A3 | B3 | C3 | 30 | 50    |    -20
``````

### b) Fit a base learner to pseudo-residuals

In this step, we train a weak model (decision tree regressor) on the training data where `residual_i_m` becomes the dependent variable.

Let’s call the model `h_m(x)`.

For instance, let’s apply our `h_m(x)` to our previously training data. In this case, `h_m(x)` becomes `h_1(x)` since it’s our first weak model.

Note that the values of `h_1(x)` are dummy values.

``````A  | B  | C  | y  | F0(x) | residual_1 | h_1(x)
===============================================
A1 | B1 | C1 | 50 | 50    |    0       |  0.5
A2 | B2 | C2 | 70 | 50    |    20      |  18
A3 | B3 | C3 | 30 | 50    |    -20     |  -18
``````

### c) Compute multiplier for the weak model

In this step, we solve the following optimization problem. We’d like to find the multiplier (`gamma`) that minimizes the sum of loss functions.

``````gamma_min = argmin(gamma) SUM(i=1 to n) L(yi, Fm-1(xi) + gamma * h_m(xi))
``````

Let’s apply this step on our training data (the one with `h_m(x)` column).

``````sum_loss_func = L(y1, F0(x1) + gamma * h_1(x1)) + L(y2, F0(x2) + gamma * h_1(x2)) + L(y3, F0(x3) + gamma * h_1(x3))

sum_loss_func = 0.5 * (y1 - (F0(x1) + gamma * h_1(x1)))^2 \
+ 0.5 * (y2 - (F0(x2) + gamma * h_1(x2)))^2 \
+ 0.5 * (y3 - (F0(x3) + gamma * h_1(x3)))^2

sum_loss_func = 0.5 * (50 - (50 + gamma*0.5))^2 \
+ 0.5 * (70 - (50 + gamma*18))^2 \
+ 0.5 * (30 - (50 + gamma*(-18)))^2

sum_loss_func = 0.5 * (-gamma*0.5))^2 \
+ 0.5 * (20 + gamma*18)^2 \
+ 0.5 * (-20 + gamma*(-18))^2
``````

Afterwards, let’s take the first derivative of `sum_loss_func` with respect to `gamma` and set the equation to zero.

``````d(sum_loss_func) = (-gamma*0.5))*(-0.5) + (20 + gamma*18)*(18) + (-20 + gamma*(-18))*(-18)
----------------
d(gamma)

d(sum_loss_func) = gamma*0.25 + 360 + gamma*324 + 360 + gamma*(324)
----------------
d(gamma)

d(sum_loss_func) = gamma*648.25 + 720
----------------
d(gamma)

Setting the above to zero yields -720 = 648.25*gamma, therefore gamma = -1.11
``````

From the above calculation, we get `gamma_min` equals to `-1.11`.

### d) Update the model

In this step, we calculate the following.

``````Fm(x) = Fm-1(x) + gamma_min * h_m(x)
``````

Applying the above formula to our training data, we’ll calculate `F1(x) = F0(x) + gamma_min * h_1(x)` and get the following.

``````A  | B  | C  | y  | F0(x) | residual_1 | h_1(x) |        F1(x)
======================================================================
A1 | B1 | C1 | 50 | 50    |    0       |  0.5   | 50 + (-1.11) * 0.5
A2 | B2 | C2 | 70 | 50    |    20      |  18    | 50 + (-1.11) * 18
A3 | B3 | C3 | 30 | 50    |    -20     |  -18   | 50 + (-1.11) * (-18)
``````

## Output the final model

After constructing `M` base learners for pseudo-residuals, we finally come up with our final model, that is `FM(x)`.

Tags: