# Kruskal-Wallis Test Statistic Formula Derivation When No Tied Values Exist

Published:

In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):

``````H = (N - 1) *     SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
-----------------------------------------------
SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2
``````

Where:

• `n_i`: the number of observations in group `i`
• `r_i_j`: the rank (among all observations) of observation `j` from group `i`
• `N`: the total number of observations across all groups
• `r_i_hat`: the average rank of all observations in group `i` which is given by `(1/n_i) * (SUM(j=1 to n_i) r_i_j)`
• `r_hat`: the average of all the `r_i_j` which is given by `0.5 * (N + 1)`

In addition, you might see that if the combined observations doesn’t consist of the same values, then the test statistic could be expressed alternatively as follow:

``````H = [[12 / (N * (N + 1))] * SUM(i=1 to g) n_i * (r_i_hat)^2] - 3 * (N + 1)    --> We're going to prove this
``````

In this post, we’re going to look at how to derive the above formula (H statistic without tied values).

## Start from the denominator

We’ll start from the denominator. Let’s take a look at what the denominator would be if there’re no tied values.

Expanding the denominator yields the following.

``````SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2

SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2 - 2*(r_i_j)*(r_hat) + (r_hat)^2        (A)
``````

## Re-structure the formula for rank average

Next, we’ll leverage the formula of `r_hat` which is `(N + 1) / 2`. It becomes the following.

``````(N + 1) / 2 = [SUM(i=1 to g) SUM(j=1 to n_i) r_i_j] / N

N * (N + 1) / 2 = SUM(i=1 to g) SUM(j=1 to n_i) r_i_j            (B)
``````

## Express (A) in form of (B)

Next, we’ll express each term in `(A)` in form of `(B)`.

``````SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2 - 2*(r_i_j)*(r_hat) + (r_hat)^2

[SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2] \
- [SUM(i=1 to g) SUM(j=1 to n_i) 2 * (r_i_j) * (r_hat)] \
+ [SUM(i=1 to g) SUM(j=1 to n_i) (r_hat)^2]                     (C)
``````

The term `[SUM(i=1 to g) SUM(j=1 to n_i) (r_hat)^2]` simply states that `(r_hat)^2` appears `N` times. Therefore, the term becomes `N * (r_hat)^2` or `N * (N + 1)^2 / 4`.

Meanwhile, the term `[SUM(i=1 to g) SUM(j=1 to n_i) 2 * (r_i_j) * (r_hat)]` can be replaced by `2 * r_hat * N * (N + 1) / 2` or `0.5 * N * (N + 1)^2`.

Last but not least, the term `[SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2]` is basically in the form of `1^2 + 2^2 + 3^2 + ... + n^2`. Recall that such a sum of squared can be expressed as `n * (n + 1) * (2n + 1) / 6` where `n = N`.

Therefore, `(C)` can be expressed as the following.

``````[N * (N + 1) * (2N + 1) / 6] - [0.5 * N * (N + 1)^2] + [N * (N + 1)^2 / 4]
``````

And expanding the above yields the following.

``````{ [2 * N * (N + 1) * (2N + 1)] - [6 * N * (N + 1)^2] + [3 * N * (N + 1)^2] } / 12

{ [N + 1] * [2N(2N+1) - 6N(N+1) + 3N(N+1)] } / 12

{ [N + 1] * [N^2 - N] } / 12

{ [N + 1] * N * [N - 1] } / 12           (D)
``````

To conclude all the process above, the denominator can be expressed by `(D)`.

## Plug in (D) to the H statistic formula

Let’s plug in `(D)` to the H statistic formula.

``````H = (N - 1) * SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
---------------------------------------
{ [N + 1] * N * [N - 1] } / 12

H = 12 * SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
---------------------------------------
{ [N + 1] * N }
``````

Continuing the process yields the following.

``````H = [12 / (N(N+1))] * [SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2]

H = [12 / (N(N+1))] * [SUM(i=1 to g) n_i * {(r_i_hat)^2 - 2*r_i_hat*r_hat + (r_hat)^2}]

H = [12 / (N(N+1))] * [SUM(i=1 to g) n_i * (r_i_hat)^2 - (n_i * 2 * r_i_hat * r_hat) + n_i * (r_hat)^2]

H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2} - {SUM(i=1 to g) n_i * 2 * r_i_hat * r_hat} + {SUM(i=1 to g) n_i * (r_hat)^2}]           (E)
``````

## Express (E) in form of N (the total number of observations across all groups)

Recall the followings before proceeding to the next step:

(X) The formula of `r_i_hat` is `(1/n_i) * (SUM(j=1 to n_i) r_i_j)`
(Y) The formula of `r_hat` is `(N + 1) / 2`
(Z) The formula from `(B)` is `N * (N + 1) / 2 = SUM(i=1 to g) SUM(j=1 to n_i) r_i_j`

We’ll expand the following terms so that it’s in the form of `N`:

(P) `SUM(i=1 to g) n_i * 2 * r_i_hat * r_hat`
(Q) `SUM(i=1 to g) n_i * (r_hat)^2`

Now let’s take a look at `(P)` first.

``````SUM(i=1 to g) n_i * 2 * r_i_hat * r_hat

Using (X) and (Y) to replace r_i_hat and r_hat yields the following:

SUM(i=1 to g) n_i * 2 * (1/n_i) * (SUM(j=1 to n_i) r_i_j) * (N + 1) / 2

SUM(i=1 to g) (SUM(j=1 to n_i) r_i_j) * (N + 1)

(N + 1) * SUM(i=1 to g) (SUM(j=1 to n_i) r_i_j)

Using (Z), the above can be expressed with the following:

N * (N + 1)^2 / 2           (F)
``````

Next, let’s take a look at `(Q)`.

``````SUM(i=1 to g) n_i * (r_hat)^2

Using (Y) to replace r_hat yields the following:

SUM(i=1 to g) n_i * (N + 1)^2 / 4

[(N + 1)^2 / 4] * SUM(i=1 to g) n_i

Know that SUM(i=1 to g) n_i is simply N or the total number of observations (combined groups).

Therefore, we get:

N * (N + 1)^2 / 4          (G)
``````

## Plug in (F) and (G) to (E)

Finally, plugging in `(F)` and `(G)` to `(E)`.

``````H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2} - {N * (N + 1)^2 / 2} + {N * (N + 1)^2 / 4}]

H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2} - {N * (N + 1)^2 / 4}]

H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2}] - [12 / (N * (N + 1))] * [{N * (N + 1)^2 / 4}]

H = [[12 / (N * (N + 1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2}]] - [3 * (N + 1)]
``````

Done.

Tags: