when i first learned induction, the teacher had us prove \sum_{ k = 1 }^n k = \binom{ n + 1 }{ 2 } . she gave us the famous story of little gauss using his pairing trick to bypass the summing exercise.

our teacher wanted to make sure we really understood induction, so she gave us a harder identity to prove:

\sum_{ k = 1 }^{ n - 1 }  k^2  = ( n^3 / 3 ) - ( n^2 / 2 ) + ( n / 6 )

no big deal: you plug and chug and the identity is proved.

proofs by induction, at least taught in this way, didn’t do much by way of motivation: they are used to verify rather than deduce.

i wanted to understand how someone could come up with the identity in the first place. i wanted a machinery to derive those types of identities in general. so i asked our teacher how gauss came up with this identity.

the teachers response was that gauss was brilliant and that he guessed at the solution (but was quick to point out that once the solution was found, it was easy to verify by induction). i found this horrifying, as the answer looked like it was drawn from thin air. i imagined gauss as having a direct line to god. some profound understanding of how the universe worked, to allow him to come up with this answer. it was at this moment that i gave up any hope of ever being a mathematician.

years later i found out that this was total crap. no doubt gauss was brilliant, but proofs by induction, especially these types of identities, can be proven in a very systematic way. no doubt it was either well known in gauss’s time or he had this type of machinery at hand. a friend of mine showed me the machinery to generate all identities of this sort. it requires some insight but a tenable one. certainly a far cry from the intellectual deification that i was attributing to gauss.

we’ll create two operators, S and D.

S[ x_k ] = \sum_{ k = 0 }^{ n - 1 } x_k

where n is arbitrary but fixed.

D[ x_k ] = x_{ k + 1 } - x_k

in both of these cases, its just a simple textual substitution. wherever you the symbols S[ x_k ] or D[ x_k ] , you can replace it appropriately.


S[ D[ x_k ] ] = S[ x_{ k + 1 } - x_k ] = \sum_{ k = 0 }^{ n - 1 } { x_{ k + 1 } - x_k } = x_n - x_0

using this, we can apply it to our original sum of interest:

S[ D[ k^3 ] ] = n^3 - 0^3 = n^3

but done differently:

S[ D[ k^3 ] ] = S[ ( k + 1 )^3 - k^3 ] = S[ 3 k^2 + 3 k + 1 ] = 3 S[ k^2 ] + 3 S[ k ] + S[ 1 ]

equating the two and after some fiddling:

S[ D[ k^3 ] ] = n^3 =  3 S[ k^2 ] + 3 S[ k ] + S[ 1 ]
\to S[ k^2 ] = ( n^3 - 3 S[ k ] + S[ 1 ] ) / 3
\to S[ k^2 ] = ( n^3 / 3 ) - ( n^2 / 2 ) + ( n / 6 )

giving us the identity we wanted. we could just assume that we were looking for some polynomial of degree 3 then solve for each of the coefficients, but this seems a bit unsatisfactory to me. i like this way a lot better.


a french lesson


i always had trouble remember what surjective and injective meant.  it also always annoyed me that people wouldn’t just use onto or one-to-one.

now i know why…”sur” means “on” in french and “in” is a bastardization of “un“, which means “one” in french.

this is well known, but i just came across this the other day and was a little surprised.

say i have an array that holds n values. n-1 array positions are 0 valued with only one position that has a 1 value. how many accesses to the array does it take, on average, before i have a 1/2 likelihood of finding the 1 valued position?

in other words, how many tries does it take before i have a 1/2 probability of finding the needle in the haystack?

naively, we might think to just randomly query the array until we find the desired entry. the probability of _not_ finding an entry in this manner is (1 – 1/n) for each try. call k the number of queries, then to find k, at least roughly, set it equal to 1/2 and solve:
( 1 -  1 / n ) ^ k  = ( 1 -  1 / n ) ^ { n k / n }
\approx e^{ - k / n } = 1/2
\rightarrow k \approx n ln(2)
\rightarrow k \propto n

meaning we need to query a constant proportion of the array before we can have a reasonable chance of finding the needle.

now, instead of just probing randomly, lets probe to make sure we never query the same location twice. in the above example, we chose completely at random, potentially accessing an array element twice. this time, we don’t do that.

so now, the probability of not finding the needle, without replacement, of k queries:

\prod_{ j = 1 }^k ( 1 -  j / n ) \leq \prod_{ j = 1 }^k e^{ - j / n }
= exp( - \binom{ k + 1 }{ 2 } / n )

and we want to find k when this probability is about 1/2:

exp( - \binom{ k + 1 }{ 2 } / n ) \approx 1/2
\rightarrow k ( k + 1 ) \approx 2 n ln(2)
\rightarrow k \propto \sqrt{n}

meaning, without replacement, we only need to query the square root of the number of entries than we would if we did it with replacement.

coincidentally, in a probabilistic sense, this as good as grovers algorithm.

part 1 part 2 part 3

and finishing up…

“why are the exponents of the power law distribution usually between 1 and 3”?

less than 1, its not a probability distribution anymore. greater than 3 gives us finite variance and the central limit theorem applies. only in the range of 1 to 3 is it still a probability distribution but has infinite second moment, where the assumptions of the central limit theorem are violated.

if that did it for you, stop reading now. otherwise, below i just work this explanation out in a bit more detail.

for concreteness, we assume p(x) is our probability distribution function of our random variable X and that p(x) \propto x^{- \alpha} , with 1 < \alpha <= 3 .

so lets consider the renormalization factor. p(x) is a probability distribution, so we would like its sum to be 1. so we want to find the renormalization factor, R, that will do this. in symbols:

R = \int_{ 1 }^{ \infty } x^{ - \alpha } dx

for \alpha <= 1 , R \to \infty , which is not very desirable in a probability distribution. so \alpha must be greater than 1 if we want R finite and still have a valid pdf.

so why the cutoff at 3? well, as we know by now, finite variance means that the central limit theorem takes over. finite variance occurs when the second moment is finite. in symbols:

E[ X^2 ] = R^{ - 1 } \int_{ 1 }^{ \infty } x^2 x^{ - \alpha } dx

recalling our freshman calculus…

\to E[ X^2 ] \propto x^{3 - \alpha} \rvert_{1}^{\infty}

which we know is finite when 3 - \alpha < 0 .

and to point out the obvious: for 1 < \alpha < 2 we have infinite mean and infinite variance, whereas for 2 < \alpha < 3 we have finite mean but still infinite variance.

i’m a little hazy on what happens exactly at 2 and there are some technical conditions that allow for \alpha = 3 to still be power law which i won’t claim to have any deep understanding of. see here for more details on those conditions, though no proofs thereof.

part 1 part 2

continuing ever on…

“why are power laws and fractals mentioned in the same breath”?

vaguely: power laws represent a statistical fractal.

fractals are pretty ubiquitous and popular, so i’m assuming every ones familiar with the concept. but the image that comes to mind when discussing fractals are mandelbrot sets, julia curves, koch curves, serpinski gaskets, etc. these are all visually stunning and exemplify the scale symmetry nicely, but they all have a feature that’s a bit contrived: they’re all highly regular.

“nature” is much messier than this. instead of getting an exact copy back again after rescaling, often you get something that’s similar looking, but not exact.

for example, take a graph with a power law degree distribution. maybe this graph was generated by considering the internet web graph. maybe its a graph generated via preferential attachment. maybe its just a graph generated by forcing a degree sequence to be power law and and connecting vertices together as best as you can.

all of these have a power law in degree distribution, but only statistically so. after a suitiable rescale (and this operation might not be as trivial as you think) the resulting, smaller, graph will still have a “look and feel” of the larger graph. this look and feel is the degree distribution, which is again, power law, as was the original graph.

so while images we might normally associate with the term fractal are rigorously symmetric and regularly structured, they can also include this much “messier” version which is the power law.

remember that probability distributions are very general. as a rule of thumb, features of the underlying distributions wash out and whats left is the main feature. in this case, the power law. so don’t be too fooled by the apparent simplicity of the power law. the underlying machinations of the distribution might be quite complex but still give rise to the simple power law. of course, sometimes that’s not the case and simple processes give rise to a simple power law, so always best to be careful. see here for some further discussion.

best to mention that this is most definitely my own take on the subject, so please don’t take the above as gospel truth.

should also be mentioned that there is a multi-fractal concept out there, which i know little about, but that touches on the above subjects. i’m not exactly sure what the connection to power law distributions (or more specifically levy stable distributions) these have, so if anyone has any suggestions on this, i would like to hear them.

for a video on multi-fractals see here. she obviously only has a tenuous grasp of the subjects she’s lecturing on, but the talk is still worthwhile. she completely fumbles all the questions asked her, so you should see here (which she recommends in the lecture) to fill in the gaps on the differences between power laws/multiplicative processes and log-normal distributions . i’ve found here for a deeper, but not very enlightening, paper on the subject. and of course, see here for why we should care in the first place about binomial cascades, fractional brownian motion and multi-fractal stuff (hint, finance) that are mentioned in that video lecture.

part 1

last entry we learned about how power laws come about so naturally. lets continue, shall we?

“what does ‘scale-free’ mean and why are power laws free of scale”?

zoom out from a power-law distribution, rescale and you will get the same distribution back.

so, lets back up a bit:

symmetries are pretty basic. its a way of describing repeated patterns in nature.

take a wallpaper pattern, for example. take it off the wall, shift it by a certain amount and replaster. you’ll find it looks exactly the same. this is translational symmetry.

take a daisy and twirl it around. looks the same. this is rotational symmetry.

look at a tree, then look at one of its branches stuck in the ground. not exact of course, but to a good approximation they look very much alike. the branch being a mini version of the tree. this is scale symmetry.

scale free means that there is a scale symmetry. basically, if you zoom out (or in) of a picture and it looks the same, then its scale-free. the term scale-free comes from the fact that since you get the same picture back again after you zoom in or out, you can’t tell what the characteristic length scale of the picture was, since it didn’t have on in the first place.

some examples will help: consider a square on a piece of paper with a certain length to its side. zoom out by a certain factor. measure the sides and you find that the sides have shrunk. no mystery, you zoom out and the square shrinks. this example is pretty loose but you can see that the “system” in this case has the square’s side as its characteristic length scale and it can be determined by measurement, even after our zooming operation.

now do the same for the coastline of britain, a koch curve, a tree and its branch: you find in each of these cases that its difficult to know whether you zoomed in or out since they all look the same after you’ve done your zooming operation. each of these systems has a scale symmetry, they look the same after rescaling, and are thus scale-free.

now, zoom out of a power law distribution and you’ll find the same distribution again. so power laws are have a scale symmetry and thus are scale free.

in symbols:

p(x) \propto x^{-\alpha}
\to p( x/a ) \propto (x/a)^{-\alpha} = a^{\alpha} x^{-\alpha}
\to p(x/a) = a^{\alpha} p(x)

meaning, besides the factor in front, rescale by a, get a power law distribution back out again.

to see why the above condition is exceptional, consider if p(x) \propto e^{-x^2} or p(x) \propto e^{-x} (two very common distributions). in both of these cases, p(x/a) \ne a^{\alpha} p(x) . both of those two distributions would be said to have characteristic length scale and thus wouldn’t be scale-free.

power laws show up in all sorts of places. why this is wasn’t so clear to me and i found it difficult to find a simple answer on the web for why this was. the questions i had when i was first learning this stuff were “why are power laws so common in nature”? “what does ‘scale-free’ mean and why are power laws free of scale”? “why are power laws and fractals mentioned in the same breath”? “why are the exponents of the power law distribution always between 1 and 3”? so i’ll be attempting to answer, if only vaguely, these questions. i am absolutely no expert but i think i have a basic understanding of this stuff which i hope to pass on to you, the gentle reader.

“why are power laws so common in nature”?

to put it vaguely: power laws are the convergent, limiting distributions of a very general class of underlying probability distributions.

i am, of course, bastardizing the term power law. the above sentence should really have “power law” replaced with “levy stable distribution“, but we’re only dealing in broad strokes here.

levy stable distributions are not easily expressed and, in general, do not have a closed form solution for their distribution function. never the less, under certain conditions, which i’ll get into below, they are easily understood when considering the probability distribution function very far away from 0. under suitable conditions, the levy stable distribution behaves as a power law and this is why i don’t feel bad about using the term “power law” and “levy stable” somewhat interchangeably.

this is basically saying, sum a bunch of random variables and consider the resulting probability distribution. doing this will result in a levy stable distribution and, under conditions which i’ll get into, the “tails” of the distribution (x >> 0) will be power law.

now, some of you might be balking at the above paragraph. here’s how to fool a mathematician: ask “whats the limiting distribution of summing n independent identically distributed random variables”? if their answer invokes the central limit theorem or has mention of “gaussian”, “normal” or “bell curve”: success! you’ve fooled a mathematician.

look up the assumptions for the central limit theorem and you will see one of the conditions is that the random variables have finite variance and thus finite second moment. i made no such condition.

now we see the condition that i was alluding to above: infinite second moment.

when you have n identically independent distributed random variables with infinite second moment, their limiting distribution is levy stable with a power law tail. if the random variables had finite second moment then the central limit theorem does apply and the gaussian falls out. perhaps confusingly, the gaussian is a special case of the levy stable distribution and, of course, does not have a power law tail. at the risk of belaboring the point: infinite second moment -> power law, finite second moment -> gaussian, both cases of which are encapsulated by the levy stable distribution.

the “normal” in “normal distribution” is why we can fool mathematicians. i’m only guessing, but i suppose infinite second moment was considered unnatural, thus giving credence to the argument that under very general, “natural” conditions, you would have a sum of random variables with finite variance. sum enough of these finite varianced random variables and you get what “normally” happens: a gaussian. power laws and levy stable distributions have been known for a while, but only recently has the commonality of these distributions been noticed. ironically, ask the same question to a physicist and you’ll most likely not fool them, since power laws come up frequently in their day to day activities.

now, heuristically, infinite second moment makes the distribution “jumpy”, meaning you will see large deviations from the mean (if it has one). toss a bunch of coins and look at the number of heads. do this experiment enough times and the distribution will have a large peak at the mean with any small deviation from the peak tapering off extremely quickly. do the same with a random variable with infinite second moment and the tails will be much, much fatter.

perhaps this is why power laws show up so much…this condition on finite second moment is un-natural. at least for some commonly observed phenomena, infinite second moment is quite common, leading to power law distributions.

i have focused on summing random variables, but there are also many other common ways to generate power law distributed processes that involve multiplication of random variables rather than addition. for example, flip a (perhaps weighted) coin n times, but instead of adding 1 for heads, 0 for tails, multiply 2 for heads or 1/2 for tails. do this many times and you get a power law. another one: consider a random variable with exponentially decreasing probability (at a certain rate) with an exponentially increasing value (at a certain, different, rate), this also leads to a power law.

see here for some common ways to generate multiplicative power laws “naturally”,which is where i got the latter example. see here for why the multiplicative coins aren’t log-normally distributed.

i have not found a good resource, dead-tree or online, that gives a proof (let alone an understandable one) of why infinite varianced random variables gives a levy-stable distribution. nor have i found a good text for levy stable distributions in general. any pointers or references would be much appreciated.