Let \(X_1,X_2,\ldots,X_n\) be independent and identically distributed with common mean \(\mu\) and common variance \(\sigma^2\). I.e. that \(E(X_i)=\mu\) and Var\((X_i)=\sigma^2\) for \(i-1,2,\ldots,n\).
Then we can calculate the sum of these \(X_i\)’s and their sample mean \(\overline X_n\): \[S_n=\sum_{i=1}^n X_i\] \[\overline X_n=\frac1n \sum_{i=1}^n X_i\]
The central limit theorem states that as long as the sample size \(n\) is large enough, then we approximate the distribution of \(\overline X_n\) and \(S_n\) by normal distributions:
\[\overline X_n \sim N\left(\mu, \frac{\sigma^2}{n}\right)\] \[S_n \sim N\left(n\mu, n\sigma^2\right)\]
If the \(X_i\) are truly normally distributed, then the central limit theorem is exactly true and is no longer an approximation.
As long as the \(X_i\) data are not too skewed with an extremely high probability of large outliers, the CLT approximation is fairly good. A general rule of thumb is: if \(n\geq 30\), then the CLT approximation can be used. If the \(X_i\) data are extremely badly skewed, then a sample size in the hundreds or thousands may be required to make the CLT approximation good.
Since the CLT says that \(\overline X_n \sim N\left(\mu, \frac{\sigma^2}{n}\right)\) and \(S_n \sim N\left(n\mu, n\sigma^2\right)\), we can standardized these two variables. \[X\sim N(\mu,\sigma^2) \ \text{ then } \ \frac{X-\mu}{\sigma}\sim N(0,1)\] Thus
\[\frac{\overline X_n - \mu}{\frac{\sigma^2}{n}} \sim N\left(0,1\right)\] \[\frac{S_n-n\mu}{\sigma\sqrt{n}} \sim N\left(0,1\right)\]
We will use this fact in the simulations that follow. Since increasing \(n\) will decrease the variance of \(\overline X_n\), it will make plotting the histogram difficult as it will become skinnier and taller. However, standardizing will keep this from happening since the goal is for the histogram to always approximate a standard normal pdf.
Here we will explore the CLT with simulations in a few different ways.
First we assume the data is normally distributed: \(X_i\sim N(\mu,\sigma^2)\).
Here is the process that occurs in the simulation code that follows:
General behavior of the following simulations:
Nsims
will make the histogram more jagged since it will be constructed from a smaller dataset of \(\overline X\)’s.Nsims
will make the histogram fit the plotted probability density functions better.n
will not have any effect on how well the histogram fits the pdf from the CLT when the underlying data is already normally distributed, but if the data is not standardized, increasing n
will make the histogram and pdf get very skinny and tall. If the data is standardized before plotting the histogram, increasing n
will not make the plots skinny and tall.n
will have the histogram fit the plotted probability density functions better, but it still make be quite jagged unless Nsims
is big enough.In this simulation, we start with a sample size of \(n=2\), thus we are only averaging two \(X\)-values to get a sample mean. Ten thousand such sample means are generated and a histogram of them is compared with the normal pdf predicted by the CLT.
Try the following:
Nsims
and increasing it to see the histogram become more and less jagged.(Note: If you make Nsims
too large, the simulation will take too long and will be terminated early. The server that is called upon to do the calculation only allows 10 seconds maximum computer time per simulation. I have found that Nsims
<100,000=10^5 is usually fine.)
R code for central limit theorem with normal data, non-standardized:
In this simulation, we again start with a sample size of \(n=2\). Ten thousand such sample means are generated and a histogram of them is compared with the normal pdf predicted by the CLT. This time we standardize the data and plot it against a standard normal pdf for comparison.
\[\frac{\overline X_n-\mu}{\frac{\sigma}{\sqrt n}} = Z\sim N(0,1)\]
Try the following:
Nsims
and increasing it to see the histogram become more and less jagged.R code for central limit theorem with normal data, standardized:
Now we investigate the CLT when the underlying data is not normally distributed. Here we will let the \(X_i\) be exponentially distributed. This means that the normal approximation from the CLT will not be as good of a fit.
In the following simulations \(X_i\sim Exp(\lambda)\) with \(E(X_i)=\frac1\lambda\) and \(Var(X_i)=\frac1{\lambda^2}\).
Thus \[\overline X_n \ \text{ is approximately } \ N\left(\frac1\lambda,\frac1{\lambda^2n^2}\right)\]
In this simulation, we again start with a sample size of \(n=2\), thus we are only averaging two \(X\)-values to get a sample mean. Ten thousand such sample means are generated and a histogram of them is compared with the normal pdf predicted by the CLT.
Try the following:
Nsims
and increasing it to see the histogram become more and less jagged.R code for central limit theorem with exponential data, non-standardized:
Now we do the same simulation as above for the exponential distribution, but we standardized the output: \[\frac{\overline X_n-\frac{1}{\lambda}}{\frac{1}{\lambda\sqrt n}} \approx Z\sim N(0,1)\]
In this simulation, look closely at the tails of the distribution. Even with \(n\) somewhat large, in the hundreds, you will see a deviation from the normal approximation that did not occur in the previous simulations with normal data. Try this simulation with \(n=100\) and Nsims=10^5
and compare it to the above standardized normal one with the same parameters.
R code for central limit theorem with exponential data, standardized:
Here we are again looking at underlying normally distributed data. Instead of standardizing with \(\sigma\), we standardize with \(s\) and this will give us the \(t\)-distribution.
\[X_i \sim N(\mu,\sigma^2)\] \[X_1,X_2,\ldots,X_n \ \text{ are iid }\] With sample mean \[\overline X_n=\frac1n\sum_{i=1}^nX_i\] and sample standard deviation \[s_n=\frac1n\sum_{i=1}^n(X_i-\overline X_n)^2\] Then \[\frac{\overline X_n-\mu}{\frac{s_n}{\sqrt n}}\sim T_{n-1}\] where \(T_{n-1}\) follows the Student’s \(t\)-distribution with \(n-1\) degrees of freedom.
In the following simulation, observe:
Nsims
will make the histogram more jagged, and increasing it will make the histogram fit the \(t\)-distribution better.Try \(n=2,3,5,8,10,15,20,25,30\) and observe what happens.
R code for simulating normal data and how when standardized with the sample mean, it follows the t-distribution: