Imagine that you have some data
x1, x2, x3, ..., xn originating from an unknown continuous distribution
f. You’d like to estimate
There are several approaches of achieving the task. One of them is by estimating
f with the empirical distribution. However, if the data is continuous, the empirical distribution might result in an uniform distribution since each data would most likely appear once, therefore has the same probability to occur.
Another approach is by bucketizing the data into certain intervals. However, this approach will yield some number of buckets (discrete distribution) rather than a continuous distribution.
The next approach is by leveraging what’s called as Parzen Window Density Estimation. It’s a nonparametric method for estimating the probability density function given the data. In addition, it’s just another name for Kernel Density Estimation (KDE).
Basically, KDE estimates the true distribution by a mixture of kernels that are centered at
xi and have bandwidth (scale) equals to
h. Here’s the formula of KDE.
f_hat(x) = (1/(n.h)) * SUM(i=1 to n) K((x - xi) / h) K: kernel n: number of samples h: bandwidth
Kernel can be described as a probability density function that is used to estimate density of data located close to a datapoint
xi. It could be any function though as long as the required properties are fulfilled. Those properties are the sum of probability (area under the PDF) is one, the function is symetric (K(-u) = K(u)) and the function is a non-negative one. Several kernel examples are gaussian, uniform, triangular, Epanechnikov, biweight, triweight, and so forth. If you browse them, you may notice that most of the kernel functions have a support of -1 to 1 (when the function is centered at 0).
So, why would we need to have kernel and bandwidth in the first place?
Since the datapoint
xi is just a sample coming from a continuous distribution
f, we might presume that there might be some unknown (nonzero) density around
xi. In other words, there might be some datapoints from the population that are very close to
xi. These close datapoints are represented with a density function called kernel.
However, how close the datapoints are from
xi (because a term “close” is relative)? This is determined by the bandwidth (
h). The following paragraph might describe this a bit clearer.
If you look at the formula of
f_hat(x), the term
K((x - xi) / h) denotes that the kernel function is centered at
xi and it calculates the value of PDF for point
x and scaled by
h. As I’ve mentioned before on the previous paragraphs, most of the kernel functions (centered at 0) have a range (support) from -1 to 1. If the kernel function is centered at 0, then the minimum and maximum data in the kernel function are -1 and 1 respectively. Now, if the kernel function is centered at
xi, then the minimum and maximum data are
xi - 1 and
xi + 1 respectively. The same concept applies when a bandwidth
h is implemented. The calculation goes as follows:
- min value:
(x - xi) / h = -1which yields
x_min = xi - h
- max value:
(x - xi) / h = 1which yields
x_max = xi + h
The above can be summarized as that the bandwidth
h makes the range of the kernel function to
[xi - h, xi + h].
f_hat(x), we might estimate the probability density of any point
x by these following steps:
- Determine the bandwidth
hthat will be used
- For each sample data
xi, calculate the probability density for
xusing kernel function centered at
xi. Specifically, calculate
K((x - xi) / h). Let’s denote the result as
- Sum up
i = 1 to n(
nis the number of sample data)