An Introduction to the Central Limit Theorem

jordigh · on Feb 12, 2015

> Often referred to as the cornerstone of statistics

Well... often referred to as the central theorem of statistics. Each time you say its name. What's central is the theorem, not the limit. It was Polya who first called it that, "zentraler Grenzwertsatz".

> Why the Central Limit Theorem Works

Well... I don't think that's really an explanation at all of why e^(-x^2/2) is such a privileged function. Why would any distribution converge to a normal distribution?

It essentially boils down to the Fourier transform. When you take the Fourier transform of the sample means, if you ignore all but the quadratic terms (there are no linear terms if you centralise to mean 0 and variance 1), you get the exponential limit (1 - t^2/2n)^n. That's the Gaussian, which is its own Fourier transform.

https://en.wikipedia.org/wiki/Central_limit_theorem#Proof_of...

In other words, because the Gaussian is its own Fourier transform, sample means converge to the Gaussian.

mturmon · on Feb 12, 2015

This is a very nice comment, because it points out that the Gaussian "answer" is hidden in the structure of the question.

You know the zero'th moment (the distribution has unit mass), and you have removed the first moment (by subtracting the mean), so the next is the second. By passing to a limit, you remove the higher moments from consideration, and isolate on the second-order behavior of the solution near the origin (in characteristic function coordinates). Only the second-order behavior matters, and the Gaussian is the distribution that has a characteristic function with a quadratic behavior at the origin (in the exponent).

The real structure of the problem is that it assumes independence, or a similar condition (like mixing) strong enough to mimic independence. Independence is what allows you to pass to the limit and thereby focus on the second-order behavior around the origin.

And in practice, you have just swept the bump under the rug, because you now have to justify your independence assumption -- and that's hard.

My quibble is with the side comment about the Gaussian being its own Fourier transform. This property is required (i.e., the Gaussian must be a fixed point of the CLT), but the property does not provide any constraint on the class of possible solutions of the CLT.

In fact, there is an infinite number of functions that transform into themselves. I can easily construct a new one from any Fourier-transformable function you give me. For example, I can make such a function that looks like a square wave with some wiggles in it. (Reference: http://www.systems.caltech.edu/dsp/ppv/papers/journal08post/...)

Chinjut · on Feb 12, 2015

Why is it important for the existence of a CLT type result that the Gaussian be a fixed point of the Fourier transform? Regardless of whether it were or not, we would find that characteristic functions of (suitably scaled) sample means converged to e^(-x^2/2), and thus the Fourier transform of e^(-x^2/2) would be their limiting distribution. That this Fourier transform is e^(-x^2/2) again does not seem essential.

mturmon · on Feb 13, 2015

Hmm, you know, I think you have a point. Clearly any CLT solution must be a fixed point of convolution (as the Gaussian is).

The other property ("Gaussian is its own Fourier transform") seemed important, but I'm unable to formalize the argument, and don't think that "fixed point of convolution" requires it. Thanks for keeping me honest.

Unless I'm missing something, your observation would render the closing statement in the GP comment ("because the Gaussian is its own Fourier transform") completely irrelevant.

jamessb · on Feb 12, 2015

> Well... I don't think that's really an explanation at all of why e^(-x^2) is such a privileged function. Why would any distribution converge to a normal distribution?

> It essentially boils down to the Fourier transform

Precisely.

One motivation for taking a fourier transform is that the the p.d.f. of the sum of several random variables is the convolution of corresponding p.d.fs, and that convolution in the original domain is equivalent to multiplication in the fourier domain (by the convolution theorem).

shas3 · on Feb 12, 2015

...and p.d.f's convolve when you derive the distribution of a random variable defined as sum of other independent random variables.

frozenport · on Feb 12, 2015

Another is that the integral of rand(x), strictly doesnt converge at the endpoints, although a generalized fourier transform exists.

Chinjut · on Feb 12, 2015

It doesn't matter that the Gaussian is its own Fourier transform, as such, does it? It's just that, whatever it is that Fourier transforms into the Gaussian (1 - t^2/2n)^n, that's what sample means converge to.

[I suppose it does matter that this is itself Gaussian for establishing the central limit theorem as specifically about Gaussian distributions. I guess my point was that, even if, counterfactually, nothing Fourier transformed into itself, still we would have a central limit theorem about whatever distribution it was that Fourier transformed into the Gaussian]

TeMPOraL · on Feb 12, 2015

GP noted why it matters for an actual explanation:

> Well... I don't think that's really an explanation at all of why e^(-x^2/2) is such a privileged function. Why would any distribution converge to a normal distribution?

The point is, nothing in maths is arbitrary (except axioms); if you see something singled out from the space of many alternatives, there must be a good reason why it's exactly this thing, hidden in the structure of the problem. There's this rule of thumb, also important in programming, that there are generally three amounts: zero, one, or arbitrarily many.

Chinjut · on Feb 12, 2015

I don't think you've caught what I was saying. The GP noted indeed why we should expect the distribution of suitably scaled sample means to converge to the Fourier transform of e^(-x^2/2). I'm saying, it doesn't seem essential to the existence of a CLT type result that this Fourier transform is e^(-x^2/2) itself [i.e., it doesn't seem essential that e^(-x^2/2) happens to be a fixed point of the Fourier transform]; if the Fourier transform of e^(-x^2/2) were some other function, we'd still speak of a Central Limit Theorem and the privileged role of that other function instead. The Fourier transform fixed point phenomenon seems orthogonal to the distribution convergence phenomenon.

TeMPOraL · on Feb 12, 2015

Ok, I guess I didn't catch it. Thanks for the clarification.

dnautics · on Feb 14, 2015

Note though that e^(-x^2/2) while privileged, is not entirely privileged - because there are other functions that have a central limit property (e.g. 1/(1+x^2)), which is not its own Fourier transform.

pspencer · on Feb 12, 2015

I always liked this visual representation of the central limit theorem: http://blog.vctr.me/posts/central-limit-theorem.html. There is a faster one here (I think written in R): http://vis.supstat.com/2013/04/bean-machine/

These are computer simulations of Galton boxes: http://en.wikipedia.org/wiki/Bean_machine

stdbrouw · on Feb 13, 2015

As the comments to that page mention, though, it's not a representation of the central limit theorem at all. It shows how a binomial distribution "becomes" a normal distribution in the long run, but it doesn't show why this would apply to the distribution of errors of a series of sample means.

andy_wrote · on Feb 12, 2015

I have fond childhood memories of the "Mathematica" exhibit at the Museum of Science in Boston, where there was a large physical copy of one of these machines.

Also, "quincunx" is a great word.

bkcooper · on Feb 12, 2015

I think the first half of the article showing how this works with a given sample distribution is pretty good. I don't think it's really doing much to build intuition at the end, though.

It's also worth pointing out that there are distributions for which the central limit theorem doesn't hold (e.g. the sum of samples from a Lorentzian distribution will again be Lorentzian, not Gaussian.)

mturmon · on Feb 12, 2015

I think the more typical case when the CLT is applied erroneously is not heavy tails, it's cases where the independence condition does not hold.

Independence of an unbounded sequence of variables is a very strong assumption, but so easy to hide away with the magic letters, "iid".

jordigh · on Feb 12, 2015

> the sum of samples from a Lorentzian distribution will again be Lorentzian, not Gaussian

Lorentzian? I had to look that up. Oh. Cauchy distribution. Right, because it doesn't have any finite moments, because the tails are too heavy.

cozzyd · on Feb 12, 2015

Physicists like to say Lorentzian (or sometimes Breit-Wigner) instead of Cauchy.

rm999 · on Feb 12, 2015

I have a series of basic questions I include in any data science interview, and one is "please describe what the central limit theorem says in simple, high-level terms". It's absolutely amazing how many people who have great credentials can't do this. I get a lot of "any distribution becomes normal when you sample it enough". This is nonsensical and shows a lack of understanding of the theorem.

Please, if you claim to know stats, understand what the central limit theorem says. It's a pretty incredible and useful theorem.

whorleater · on Feb 12, 2015

Out of curiosity, mind sharing some of those questions? I'm a computer science major looking into getting a data science internship, and I'd love for a chance to see some real interview questions.

rm999 · on Feb 12, 2015

Yep! Here are a few, I customize the questions a bit based on seniority and background.

1. In a 1-d gaussian with mean=0 and variance=1, what is p(x=0)?

2. If I give you a csv file with a header, how would you return a tsv file with columns 6 and 2? Just high-level, not actual implementation.

3. If I have some dataset (I change it up a lot, a common one is subway ridership by station and date), how would you create a visualization to give some insight to a fairly non-technical audience, e.g. the distribution of ridership by station.

4. If I gave you a house dataset with a bunch of attributes (e.g. year built, number of sq feet, presence of pool, zip code), how would you go about building a machine learning model that predicts whether the house is being rented or is owned (assuming you are provided with tagged data)? The selling price? How many times it has been sold?

5. What's a technology that you're excited about (could be a computing tool, a machine learning algorithm in a library, etc)? Tell me some of the pros and cons of it and competing technologies.

We also give out a 4-6 hour task as homework, and evaluate the code quality, tools used, answers provided, etc.

jhallenworld · on Feb 12, 2015

My introduction to the central limit theorem was that chained independent random processes tend to result in a Gaussian distribution. This is so general that one is surprised when one finds non-Gaussian distributions (canonical example: the stock market).

I attended a lecture by Mandelbrot (shortly before he died) where he spoke at length about this- take a look at stable distributions and the generalized central limit theorem.

dnautics · on Feb 14, 2015

I think that Mandelbrot's point is actually quite the opposite - that non-gaussian stable distributions are the norm; and that we are trained to see gaussians and that's why we're surprised. When alpha is very close to 2 (e.g. 1.9 and above it really makes little difference) but even a bit lower like 1.8 you can have some very scary consequences for modelling. In his book he has an example of an insurance market and modelling the failure of insurance companies... Which was especially poignant considering what happened after he died (and what is happening now).

giarc · on Feb 13, 2015

Is it not tradition to use n for sample size rather than N? N is typically population size (in my experience).

jordigh · on Feb 13, 2015

I don't think any tradition is too entrenched on this particular point.

giarc · on Feb 13, 2015

You might be right. In my stats courses, we used lower case letters for all sample estimates and upper case for all population values.

eliwjones · on Feb 12, 2015

To me.. the core idea is that (given one chooses, over and over, from a bunch of independent and identically distributed events.):

There are more ways for everything to happen than there are ways for one thing to happen over and over.

willvarfar · on Feb 12, 2015

An interesting bit of trivia for computer history buffs:

Alan Turing independently discovered the Central Limit Theorem while still an undergrad in 1934.