# Hypothesis Testing with the Kruskal-Wallis Test

Published:

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

The test statistic calculated in this test is called the H statistic.

## Hypotheses

• H0: population medians are equal
• H1: population medians are unequal

## Assumptions

• Since this test is an extension of Mann-Whitney U Test, this test is commonly leveraged to evaluate differences between three or more groups
• The observations to evaluate should be in ordinal scale, ratio scale or interval scale
• The observations should be independent where there should be no relationship between the members in each group or between groups
• All groups should have the same shape distributions (number of peaks, symmetry, skewness)

## Test statistic formula

The general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):

``````H = (N - 1) *     SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
-----------------------------------------------
SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2
``````

Where:

• `n_i`: the number of observations in group `i`
• `r_i_j`: the rank (among all observations) of observation `j` from group `i`
• `N`: the total number of observations across all groups
• `r_i_hat`: the average rank of all observations in group `i` which is given by `(1/n_i) * (SUM(j=1 to n_i) r_i_j)`
• `r_hat`: the average of all the `r_i_j` which is given by `0.5 * (N + 1)`

In addition, you might see that if the combined observations doesn’t consist of the same values, then the test statistic could be expressed alternatively as follow:

``````H = [[12 / (N(N + 1))] * SUM(i=1 to g) n_i * (r_i_hat)^2] - 3 * (N + 1)
``````

### Notice

I mentioned about how to prove the above formula (H statistic when there are no tied values) in this post.

## How to run the test

For the sake of clarity, let’s use the following example data for the demonstration:

``````Group A: 30, 40, 50, 60
Group B: 10, 20, 70, 80
Group C: 100, 200, 300, 400
``````

Step 1. Combine the observations from all the groups and sort in ascending order

``````Combined observations: 30, 40, 50, 60, 10, 20, 70, 80, 100, 200, 300, 400
Sorted observations: 10, 20, 30, 40, 50, 60, 70, 80, 100, 200, 300, 400
``````

Step 2. Assign ranks to the sorted observations

``````Sorted observations: 10, 20, 30, 40, 50, 60, 70, 80, 100, 200, 300, 400
Ranks: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
``````

In case there are duplicate values, the rank is assigned by the following way:

• Assign normal ranks like the one in the above example
• Take the average of the ranks for the duplicate values

Here’s a simple example.

``````Sorted observations: 10, 10, 10, 30, 40, 40, 50
Normal ranks: 1, 2, 3, 4, 5, 6, 7
Averaged ranks: 2, 2, 2, 4, 5.5, 5.5, 7
``````

Step 3. Calculate the H statistic

Recall that the H statistic is given by the following:

``````H = (N - 1) *     SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
-----------------------------------------------
SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2
``````

In this case, `g` is the number of groups which is 3 (Group A, B, and C).

Before executing the formula, let’s calculate the expressions within the formula first.

The first one is `r_i_hat` which is the average rank of all observations in group `i`. It equals to `(1/n_i) * (SUM(j=1 to n_i) r_i_j)`.

Applying the formula to our data yields the following.

``````r_i_j: the rank (among all observations) of observation j from group i

Group A
r_1_hat = (1/4) * (r_1_1 + r_1_2 + r_1_3 + r_1_4)
r_1_hat = (1/4) * (3 + 4 + 5 + 6) = 1/4 * 18 = 4.5

Group B
r_2_hat = (1/4) * (r_2_1 + r_2_2 + r_2_3 + r_2_4)
r_2_hat = (1/4) * (1 + 2 + 7 + 8) = 1/4 * 18 = 4.5

Group C
r_3_hat = (1/4) * (r_3_1 + r_3_2 + r_3_3 + r_3_4)
r_3_hat = (1/4) * (9 + 10 + 11 + 12) = 1/4 * 42 = 10.5
``````

The second one is `r_hat` which is the average of all the `r_i_j`. It equals to `0.5 * (N + 1)`.

Applying the formula to our data yields the following.

``````r_hat = 0.5 * (12 + 1) = 0.5 * 13 = 6.5
``````

With all the above results, let’s compute the H statistic.

``````H = (N - 1) *     SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
-----------------------------------------------
SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2

H = (12 - 1) * [(4 * (4.5 - 6.5)^2) + (4 * (4.5 - 6.5)^2) + (4 * (10.5 - 6.5)^2)]
------------------------------------------------------------------
[(3 - 6.5)^2 + (4 - 6.5)^2 + ... + (11 - 6.5)^2 + (12 - 6.5)^2]

H = 11 *                             [(4 * 4) + (4 * 4) + (4 * 16)]
-----------------------------------------------------------------------------------------
[12.25 + 6.25 + 2.25 + 0.25 + 30.25 + 20.25 + 0.25 + 2.25 + 6.25 + 12.25 + 20.25 + 30.25]

H = 11 * (96 / 143)

H = 7.3843
``````

Step 4. State the conclusion.

After computing the H statistic, we compare the value to a critical value of chi-squared with `g - 1` degrees of freedom (`g` is the number of groups. In our example above, `g` is 3) and an alpha level. This critical value could be retrieved from the chi-squared probability distribution’s table. Let’s denote his critical value as `Hc`.

If `H` is bigger than `Hc`, then reject the null hypothesis. Otherwise, there is no evidence that the population medians are not equal.

• Wikipedia: Kruskal–Wallis one-way analysis of variance
• statisticshowto: Kruskal Wallis H Test: Definition, Examples & Assumptions