1 Central limit theorem

Let \(X_1,X_2,\ldots,X_n\) be independent and identically distributed with common mean \(\mu\) and common variance \(\sigma^2\). I.e. that \(E(X_i)=\mu\) and Var\((X_i)=\sigma^2\) for \(i-1,2,\ldots,n\).

Then we can calculate the sum of these \(X_i\)’s and their sample mean \(\overline X_n\): \[S_n=\sum_{i=1}^n X_i\] \[\overline X_n=\frac1n \sum_{i=1}^n X_i\]

The central limit theorem states that as long as the sample size \(n\) is large enough, then we approximate the distribution of \(\overline X_n\) and \(S_n\) by normal distributions:

\[\overline X_n \sim N\left(\mu, \frac{\sigma^2}{n}\right)\] \[S_n \sim N\left(n\mu, n\sigma^2\right)\]

If the \(X_i\) are truly normally distributed, then the central limit theorem is exactly true and is no longer an approximation.

As long as the \(X_i\) data are not too skewed with an extremely high probability of large outliers, the CLT approximation is fairly good. A general rule of thumb is: if \(n\geq 30\), then the CLT approximation can be used. If the \(X_i\) data are extremely badly skewed, then a sample size in the hundreds or thousands may be required to make the CLT approximation good.

1.1 Standardized \(\overline X_n\) and \(S_n\)

Since the CLT says that \(\overline X_n \sim N\left(\mu, \frac{\sigma^2}{n}\right)\) and \(S_n \sim N\left(n\mu, n\sigma^2\right)\), we can standardized these two variables. \[X\sim N(\mu,\sigma^2) \ \text{ then } \ \frac{X-\mu}{\sigma}\sim N(0,1)\] Thus

\[\frac{\overline X_n - \mu}{\frac{\sigma^2}{n}} \sim N\left(0,1\right)\] \[\frac{S_n-n\mu}{\sigma\sqrt{n}} \sim N\left(0,1\right)\]

We will use this fact in the simulations that follow. Since increasing \(n\) will decrease the variance of \(\overline X_n\), it will make plotting the histogram difficult as it will become skinnier and taller. However, standardizing will keep this from happening since the goal is for the histogram to always approximate a standard normal pdf.

Here we will explore the CLT with simulations in a few different ways.

2 Central limit theorem for normal data

First we assume the data is normally distributed: \(X_i\sim N(\mu,\sigma^2)\).

Here is the process that occurs in the simulation code that follows:

  1. Simulate a random sample of \(X\) data: \(X_1,X_2,\ldots,X_n\).
  2. Calculate the sample average \(\overline X_n\).
  3. Repeat this many many times to get a collection of sample means.
  4. Plot a histogram of the sample means.
  5. Compare this histogram to that predicted by the CLT.

General behavior of the following simulations:

  • Decreasing Nsims will make the histogram more jagged since it will be constructed from a smaller dataset of \(\overline X\)’s.
  • Increasing Nsims will make the histogram fit the plotted probability density functions better.
  • Changing n will not have any effect on how well the histogram fits the pdf from the CLT when the underlying data is already normally distributed, but if the data is not standardized, increasing n will make the histogram and pdf get very skinny and tall. If the data is standardized before plotting the histogram, increasing n will not make the plots skinny and tall.
  • If the underlying data is not normally distributed, then increasing n will have the histogram fit the plotted probability density functions better, but it still make be quite jagged unless Nsims is big enough.

2.1 R simulation of CLT (normal data)

In this simulation, we start with a sample size of \(n=2\), thus we are only averaging two \(X\)-values to get a sample mean. Ten thousand such sample means are generated and a histogram of them is compared with the normal pdf predicted by the CLT.

Try the following:

  1. Increase \(n\), try \(n=2,5,10,20,50,100,1000\).
  2. Observe what happens in the outputted graphs.
  3. Do you understand how this relates to the CLT?
  4. Try decreasing Nsims and increasing it to see the histogram become more and less jagged.

(Note: If you make Nsims too large, the simulation will take too long and will be terminated early. The server that is called upon to do the calculation only allows 10 seconds maximum computer time per simulation. I have found that Nsims<100,000=10^5 is usually fine.)

R code for central limit theorem with normal data, non-standardized:

2.2 R simulation of CLT (normal data, standardized to \(Z\))

In this simulation, we again start with a sample size of \(n=2\). Ten thousand such sample means are generated and a histogram of them is compared with the normal pdf predicted by the CLT. This time we standardize the data and plot it against a standard normal pdf for comparison.

\[\frac{\overline X_n-\mu}{\frac{\sigma}{\sqrt n}} = Z\sim N(0,1)\]

Try the following:

  1. Increase \(n\), try \(n=2,5,10,20,50,100,1000\).
  2. Observe what happens in the outputted graphs.
  3. Note that the graphs will not become skinny and tall since the data is standardized.
  4. Do you understand how this relates to the CLT?
  5. Try decreasing Nsims and increasing it to see the histogram become more and less jagged.

R code for central limit theorem with normal data, standardized:

3 Central limit theorem for exponential data

Now we investigate the CLT when the underlying data is not normally distributed. Here we will let the \(X_i\) be exponentially distributed. This means that the normal approximation from the CLT will not be as good of a fit.

In the following simulations \(X_i\sim Exp(\lambda)\) with \(E(X_i)=\frac1\lambda\) and \(Var(X_i)=\frac1{\lambda^2}\).

Thus \[\overline X_n \ \text{ is approximately } \ N\left(\frac1\lambda,\frac1{\lambda^2n^2}\right)\]

3.1 R simulation of CLT (exponential data)

In this simulation, we again start with a sample size of \(n=2\), thus we are only averaging two \(X\)-values to get a sample mean. Ten thousand such sample means are generated and a histogram of them is compared with the normal pdf predicted by the CLT.

Try the following:

  1. Increase \(n\), try \(n=2,5,10,20,50,100,1000\).
  2. Observe what happens in the outputted graphs.
  3. Do you understand how this relates to the CLT?
  4. Try decreasing Nsims and increasing it to see the histogram become more and less jagged.

R code for central limit theorem with exponential data, non-standardized:

3.2 R simulation of CLT (exponential data, standardized to \(Z\))

Now we do the same simulation as above for the exponential distribution, but we standardized the output: \[\frac{\overline X_n-\frac{1}{\lambda}}{\frac{1}{\lambda\sqrt n}} \approx Z\sim N(0,1)\]

In this simulation, look closely at the tails of the distribution. Even with \(n\) somewhat large, in the hundreds, you will see a deviation from the normal approximation that did not occur in the previous simulations with normal data. Try this simulation with \(n=100\) and Nsims=10^5 and compare it to the above standardized normal one with the same parameters.

R code for central limit theorem with exponential data, standardized:

4 The Student’s \(t\)-distribution

Here we are again looking at underlying normally distributed data. Instead of standardizing with \(\sigma\), we standardize with \(s\) and this will give us the \(t\)-distribution.

\[X_i \sim N(\mu,\sigma^2)\] \[X_1,X_2,\ldots,X_n \ \text{ are iid }\] With sample mean \[\overline X_n=\frac1n\sum_{i=1}^nX_i\] and sample standard deviation \[s_n=\frac1n\sum_{i=1}^n(X_i-\overline X_n)^2\] Then \[\frac{\overline X_n-\mu}{\frac{s_n}{\sqrt n}}\sim T_{n-1}\] where \(T_{n-1}\) follows the Student’s \(t\)-distribution with \(n-1\) degrees of freedom.

In the following simulation, observe:

  1. Decreasing Nsims will make the histogram more jagged, and increasing it will make the histogram fit the \(t\)-distribution better.
  2. Increasing \(n\) will make the \(t\)-distribution and the histogram get closer to the normal pdf.

Try \(n=2,3,5,8,10,15,20,25,30\) and observe what happens.

R code for simulating normal data and how when standardized with the sample mean, it follows the t-distribution: