The article mentions the fairly familiar fact that "median is to L1 as mean is to L2" -- i.e., the mean is the point from which the total squared error is minimized, and the median is a point from which the total absolute error is minimized.
I was telling my 8-year-old daughter about that the other day, and it occurred to me that the other "measure of central tendency" children get taught about, the mode, also fits into this scheme, with a tiny bit of fudging. Define the L0 error to be the sum of the 0th powers of the errors, with the (unusual) convention that 0^0=0. In other words, the L0 error is just the number of data points you don't get exactly right. Then the mode is the value that minimizes the L0 error. So: L0 : L1 : L2 :: mode : median : mean.
(Note 1. The obvious other Lp norm to consider is with p=infinity, in which case you're minimizing the maximum error. That gives you the average of the min and max values in your dataset. Not useful all that often.)
(Note 2. What about the rest of the "L0 column" of Ben's table? Your measure of spread is the number of values not equal to the mode. Your implied probability distribution is an improper one where there's some nonzero probability of getting the modal value, and all other values are "equally unlikely". (I repeat: not actually possible; there is no uniform distribution on the whole real line.) Your regression technique is one a little like RANSAC, where you try to fit as many points exactly as you can, and all nonzero errors are equally bad. I doubt there's any analogue of PCA, but I haven't thought about it. Your regularized-regression technique is "best subset selection", where you simply penalize for the number of nonzero coefficients.)
Define it as the limit as n goes to 0 from above of the sum of abs(act - expected)^(n), i.e. of Ln. No need to muck about with zero to the zeroth power directly.
The L0/L1 norm can have surprising benefits under certain assumptions. If you know (or assume) your signal is sparse (or mostly zero) in some domain, you're just trying to minimize the nonzero terms with the L0 norm.... But that's hard so we use the L1 norm (which still guarantees exact recovery even under Nyquist). Look up the 2004 paper by Candes and Tao for more info (titled "Exact recovery ... undersampled ...")
Wow, despite studying probability for years I never quite realised that the mean minimised the L2 norm. This is despite knowing full-well the essentially equivalent fact that the mean is the orthogonal projection in L2!
The median minimising L1 is also very nice, and I did not know that. It also means that the concept of median generalises easily to higher dimensions.
I was telling my 8-year-old daughter about that the other day, and it occurred to me that the other "measure of central tendency" children get taught about, the mode, also fits into this scheme, with a tiny bit of fudging. Define the L0 error to be the sum of the 0th powers of the errors, with the (unusual) convention that 0^0=0. In other words, the L0 error is just the number of data points you don't get exactly right. Then the mode is the value that minimizes the L0 error. So: L0 : L1 : L2 :: mode : median : mean.
(Note 1. The obvious other Lp norm to consider is with p=infinity, in which case you're minimizing the maximum error. That gives you the average of the min and max values in your dataset. Not useful all that often.)
(Note 2. What about the rest of the "L0 column" of Ben's table? Your measure of spread is the number of values not equal to the mode. Your implied probability distribution is an improper one where there's some nonzero probability of getting the modal value, and all other values are "equally unlikely". (I repeat: not actually possible; there is no uniform distribution on the whole real line.) Your regression technique is one a little like RANSAC, where you try to fit as many points exactly as you can, and all nonzero errors are equally bad. I doubt there's any analogue of PCA, but I haven't thought about it. Your regularized-regression technique is "best subset selection", where you simply penalize for the number of nonzero coefficients.)