I’ve split this post into two parts because it would be really long otherwise. Part 2 will be coming up later, hopefully.
Let’s think about how we might define surprisingness as a mathematical quantity.
Surprisingness is a property of events as perceived by observers. After the event occurs, the observer is surprised to a certain degree. The degree of surprisingness depends on how likely the observer thought it was that the event would occur, or to put it briefly the subjective probability of the event. In fact, to a reasonable approximation, at least, the degree of surprisingness depends only on the subjective probability of the event. The greater the subjective probability, the less the surprise.
For example, if you flip a coin and it comes up heads, this is not very surprising—the probability of this event was only , which is reasonably high. But if you flip ten coins and they all come up heads, you are much more surprised, due to the fact that the probability of ten heads in ten flips is only
This is a much smaller quantity than , hence the large increase in surprise. Even if you are unable to do the calculation and work out that the probability of ten heads in ten flips is exactly , you are probably aware on an intuitive level that it is a lot smaller than .
These considerations suggest surprisingness might be definable as a mapping on (the set of the real numbers between 0 and 1, inclusive), such that for every (i.e. every member of ), the value is the common surprisingness of the events of subjective probability . Listed below are three properties such a mapping should have, if it is to be in accordance with the everyday meaning of “surprise”.
Property 1. For every (i.e. every member of ), the value should be a non-negative real number, or (positive infinity).
Justification. Surprisingness seems to be reasonably well-modelled as (to use statistics jargon) an interval variable: two surprisingness values can be compared to see which is less and which is greater, there is a sensical notion of “distance” between surprisingness values, etc. Therefore, it makes sense for surprisingness to be represented by a real number. Moreover, there is a natural minimum surprisingness, namely the surprisingness of an event of subjective probability 1 (i.e. which the observer was sure would happen)—such an event is not at all surprising, and it makes sense to speak of being, e.g., “twice as surprised” by one event compared to another with reference to this natural minimum. To use statistics jargon again, surprisingness is a ratio variable. Naturally, it makes sense to set the minimum value at 0.
Property 2. The mapping should be strictly decreasing. That is, for every pair of members of such that , we should have .
Justification. As said above, events of high subjective probility are less surprising than those of low subjective probability.
Property 3. For every pair of members of , we should have
Justification. Suppose and are independent events of subjective probabilities and respectively. Then, assuming subjective probabilities are assigned in a consistent way, obeying the laws of probability, the subjective probability of (the event that both and occur) is and therefore the surprisingness of is . But given that and are independent (so the observation of one does not change the observer’s subjective probability of the other), the surprisingness of overall, i.e. , should be the same as the total surprisingness of the individual observations of and , i.e. .
Remarkably, these three properties determine the form of almost exactly (up to a vertical scale factor). Feel free to skip the proof below; it is somewhat long and tedious and doesn’t use any ideas that will be used later on in the post, and I haven’t put in much effort to make it easy to follow.
Theorem 1. The mappings on having properties 1–3 above are those of the form (i.e. those mappings on such that for every ), where is a real number greater than 1.
Proof. Suppose is a mapping on having properties 1–3 above. Let be the composition of and . Then the domain of is the set of the such that , i.e. ; and for every pair of non-positive real numbers, we have
In the case where , we see that , and we also have because , so combining the two we have , from which it follows by subtraction of from both sides that . Similarly, for every non-positive and every non-negative (the set is the set of the integers) we have , and if we assume as an inductive hypothesis that it follows that . This proves by induction that .
Now, let . For every non-positive (the set is the set of the rational numbers), there is a non-positive and a positive such that , and therefore
from which it follows that .
The equation in fact holds for every non-positive , not just for . To prove this, first, observe that is strictly decreasing, because for every pair of non-positive real numbers such that , we have and therefore (the mapping being strictly decreasing by property 2). Second, suppose is a non-positive real number and . Then , i.e. either (case 1) or (case 2). Because every non-empty interval contains a rational number, it follows that there is an such that (in case 1) or (in case 2).
In both cases, we have . This is obvious in case 1, because . As for case 2, observe first that is non-negative-valued and therefore . If is positive, it will then follow that and therefore (because ). To prove that is positive, observe that is strictly decreasing, and we have , i.e. , and , i.e. .
Because in both cases, we have . And by the strict decreasingness of it follows that either (in case 1) or (in case 2). Rearranging these two inequalities gives us (in case 1) or (in case 2). But we already have (in case 1) or (in case 2), so in both cases there is a contradiction. It follows that cannot hold; we must have .
Finishing off, note that for every , we have . Let ; then because , and , i.e. , from which it follows that for every . From this we can conclude that for every mapping on such that properties 1–3 hold, we have
where is a real number greater than 1.
In the rest of this post, suppose is a real number greater than 1. The surprisingness mapping will be defined by
It doesn’t really make any essential difference to the nature of which value of is chosen, because for every , we have (where the symbol with no base specified refers to the natural logarithm, the one to base ), and therefore the choice of base amounts to a choice of vertical scaling. Below are three plots of surprisingness against probability for three different bases (2, and 10); you can see that the shape of the graph is the same for each base, but the vertical scaling is different.
Note that regardless of the choice of base, we have
That is, events thought sure to occur are entirely unsurprising, and events thought sure not to occur are infinitely surprising when they do happen.
I’m inclined to say on this basis that 2 is the most sensible choice of base, because of all the possible probabilities other than 0 and 1, the one which is most special and thus most deserving of corresponding to the unit surprisingness is probably , the probability exactly halfway between 0 and 1. However, base is also quite convenient because it simplifies some of the formulae we’ll see later on in this post.
First, two basic definitions from probability theory (for readers who may not be familiar with them). Suppose is a finite set.
Definition 1. The probability distributions on are the positive real-valued mappings on such that
The probability distributions on can be thought of as random generators of members of . Each member of is generated with probability equal to the positive real number less than or equal to 1 associated with that member. The probabilities of generation for each member have to add up to 1, and this is expressed by equation (1). Note that the requirement that the probabilities add up to 1 implies the requirement that the probabilities are each less than or equal to 1, so there is no need to explicitly state the latter requirement in the definition.
Henceforth, suppose is a probability distribution on .
Definition 2. For every real-valued mapping on , the expected value or mean of the transformation of by is given by the formula
Let , , … and be the members of and let , , … and . Then for every real-valued mapping on , the expected value is the weighted average of , , … and , with the weights the respective probabilities , , … and .
If a very large number of values are generated (independently) by , then the average value under of these values can be calculated as
where , , … and are the frequencies of the values , , … and in the sample. Given that is very large it is very likely that the relative frequencies , , … and will be close to , , … and , respectively, and therefore the average will be close to .
Now, the definition of the key concept of this post:
Definition 3. The expected surprisingness or entropy (that’s the Greek capital letter eta, not the Latin capital letter H) of is (here, is the composition of and , i.e. ).
The use of the word “entropy” here is closely related to the use of the same word in thermodynamics. However, I won’t go into the connection here (I don’t really know enough about thermodynamics to be able to talk intelligently about it).
Having the concept of entropy to hand gives us a fun new question we can ask about any given probability distribution: what is its entropy, or, more evocatively, how surprising is it?
By (2), we have
Using the definition of , this can also be written as
Using logarithm properties, it can even be written as
This shows that is the logarithm of the reciprocal of
which is the geometric weighted average of , , … and , with the weights being identical to the values averaged. A geometric weighted average is similar to an ordinary weighted average, except that the values averaged are multiplied rather than added and the weights are exponents rather than factors. The product (5) can therefore be thought of as the “expected probability” of the value generated by . To the extent that is likely to generate one of the members of which has a low individual probability of being generated, the product (5) is small and accordingly is large. This may help give you a slightly more concrete idea why can be thought of as the expected surprisingness of .
Note that the value of is determined completely by the probabilities , , … and . The values , , … and are irrelevant in and of themselves. This is evident from the formula (3), in which the expression only appears as a sub-expression of . To understand how is determined by these probabilities, it helps to look at the graph of the mapping , which I have plotted below. The entropy is a sum of exactly values of this mapping.
It can be seen from the graph that the mapping has a maximum value. Using calculus, we can figure out what this maximum value is and exactly where it is attained.
Theorem 1. The maximum value attained by is , and this maximum is attained at the point (and nowhere else).
Proof. The derivative of is (by the product rule), which is positive if (because then ), equal to 0 if (because then ), and negative if (because then ). Therefore increases in value from the vicinity of 0 to and decreases in value from towards , which means it attains a minimum at .
Because is a sum of values of , we may conclude that the inequality
always holds. However, this upper bound on the value of can be improved, as we’ll see below. After all, in proving that it holds we’ve made no use of equation (1).
The concept of cross entropy is a useful generalization of the concept of entropy.
Suppose another probability distribution on . Think of and as two different models of the random member-of- generator: is the right model (or a more right model, if you don’t like the idea of one single right model), and is an observer’s working model. The probabilities , , … and can be thought of as the real probabilities of generation of , , … and , respectively, while the probabilities , , … and can be thought of as the observer’s subjective probabilities. Although the real probabilities determine what the observer observes, the subjective probabilities are what determine how surprised the observer is by what they observe. Therefore, if the observer calculates entropy by averaging their surprisingness over a large number of observations, they will get something close to
i.e. . This quantity is called the cross entropy from to and denoted . Note that if then ; that’s why the concept of cross entropy is a generalization of the concept of entropy. Note also that is not necessarily the same as .
Your intuition should tell you that will always be greater than , if . Why? Think about it: is the expected surprisingness if the observer has the wrong model, is the expected surprisingness if the observer has the right model. Having the right model should lead to the observer being better able to predict which member of is generated and thus to the observer being less likely to be surprised.
It is indeed the case that
if (which further reassures us that is a very good mathematical model of the intuitive notion of surprisingness). The inequality (6) is called Gibbs’ inequality. In order to prove it, first, observe that it may be rewritten as
Now, the quantity on the left-hand side of (7) has a name of its own: it’s called the Kullback-Leibler divergence from to and denoted . It measures the “penalty” in increased surprisingness the observer gets for having the wrong model; it can also be thought of as a measure of how different the probability distributions and are from each other, hence the name “divergence”. As for the “Kullback-Leibler” part, that’s just because mathematicians have come up with lots of different ways of measuring how different two probability distributions are from each other and Kullback and Leibler were the mathematicians who came up with this particular measure. I won’t be referring to any other such measures in this post, however, so whenever I need to refer to Kullback-Leibler divergence again I’ll just refer to it as “divergence”.
So Gibb’s inequality, reformulated, states that the divergence between two unequal probability distributions is always positive. To prove this, it’s helpful to first write out an explicit expression for :
Second, we prove a lemma.
Lemma 1. For every positive , we have , where is the natural logarithm (logarithm to base ) of , with equality if and only if .
Proof. The inequality in question is equivalent to (by multiplication of both sides by ). The right-hand side of this inequality is equal to . Consider the mapping . The derivative of this mapping is , which is negative if (because then and therefore ), equal to 0 if (because then ) and positive if (because then and therefore ). Therefore is strictly decreasing on and strictly increasing on . It follows that its minimum value is attained at the point . And that minimum value is .
Using Lemma 1, we have
for every , with equality if and only if , i.e. . Given that , there is at least one such that and therefore (8) holds without equality. It follows that
which proves Gibbs’ inequality.
Gibbs’ inequality is quite powerful and useful. For example, it can be used to figure out what the maximum possible entropy is. Suppose is such that , regardless of the value of . Then if , we have by Gibbs’ inequality, and therefore is the maximum entropy possible for probability distributions on and is the unique probability distribution on whose entropy is . Is there a probability distribution on such that , regardless of the value of ? There is indeed. Let be the uniform distribution on , i.e. the mapping (remember, is the number of members has). Then
so is a probability distribution, and regardless of the value of we have
Therefore, the maximum entropy possible on is , and the uniform distribution on is the probability distribution which attains this maximum.
A consequence of this is that the divergence of from the uniform distribution on is given by
which is just the negation of plus a constant (well, a constant depending on the size of the set which is distributed over). Therefore, among probability distributions on specifically, entropy can be thought of as a measure of divergence from the uniform distribution. Among probability distributions in general, entropy is a measure of both divergence from the uniform distribution and the size of the distributed-over set.
So far, we’ve seen that the informal concept of surprisingness can be formalized mathematically with quite a high degree of success—one might even say a surprising degree of success, which is fitting—and that’s pretty neat. But there are also some deeper philosophical issues related to all this. I’ve avoided talking about them up till now because philosophy is not really my field, but I couldn’t let them go completely unmentioned.
Suppose that you know that one of values , , … and values (where is a positive integer) will be generated by some probability distribution , but you don’t know the probabilities; you have absolutely no information about the probability distribution other than the set of values it may generate. What should your subjective probability distribution be? A possible answer to this question is that you shouldn’t have one—you simply don’t know the probabilities, and that’s all that can be said. And that’s reasonable enough, especially if you don’t like the concept of subjective probability or think it’s incoherent (e.g. if you’re a strict frequentist when it comes to interpreting probability). But if you accept that subjective probabilities are things it makes sense to talk about, well, the whole point of subjective probabilities is that they represent states of knowledge under uncertainty, so there’s no reason in principle to avoid having a subjective probability distribution just because this particular state of knowledge is particularly uncertain. Of course this subjective probability distribution may change as you gather more information—think of it as your “best guess” to start with, to be improved later on.
There are some people who’d say that you can choose your initial “best guess” on an essentially arbitrary basis, according to your own personal whims. No matter which subjective probability distribution you choose to start with, it will get better and better at modelling reality as you change it as you gather more information; the initial state isn’t really important. The initial state is of interest only in that if we have some updating mechanism in mind, we’d like to be able to prove that convergence to the real probability distribution will always happen independently of the initial state.
There is another position which can be taken, however, which is that there is in fact a certain objectively best subjective probability distribution to start with. This position is associated with the marquis de Laplace (1749–1827), who wrote a classic text on probability theory, the Théorie analytique des probabilités (Laplace was also a great contributor to many other fields of mathematics, as mathematicians back then tended to be—they didn’t do specialization back then). In Laplace’s opinion, the correct distribution to start with was the uniform distribution. That is, given that we know nothing about the probabilities, assume they are all the same (after all, we have no reason to believe any one is larger than any other). This principle is called the principle of indifference.
The concept of probability as a whole can be based on the idea of the principle of indifference. The idea would be that on some level, any probability distribution is over a set of interchangeable values and therefore uniform. However, we are often interested only in whether one of a particular class of values is generated (not in which particular member of that class) and the probability of interest in that case is the sum of the probabilities of each of the values in that class, which, because the underlying distribution is uniform, can also be expressed as the ratio of the number of values in the class to the total number of values which may be generated. I don’t know how far this is how Laplace thought of probability; I don’t want to be too hasty to attribute views to a 19th-century author which might be out of context or just incorrect (after all, I can’t read French so all my information about Laplace is second-hand).
It’s not hard to argue with the principle of indifference. It’s quite difficult to think of any reasonable justification for it at all. It was indeed attacked in vehement terms by later probability theorists, such as John Venn (1834–1923) (that’s right, the Venn diagram guy). In modern times, however, it has been championed by the statistical physicist E. T. Jaynes (1922–1998), who also came up with an interesting extension of it.
In the mathematical section of this post, we saw that the uniform distribution over , , … and was the one with the most entropy, i.e. the one which is “most surprising on average”. Therefore, a natural generalization of the principle of indifference would be to say that in any circumstance in which a subjective probability distribution must be assumed in a situation of uncertainty, one should assume whichever distribution has the most entropy while still being compatible with what is known. For example, if it is known what the mean is then the assumed distribution should be the one with the most entropy among all distributions having that specific mean. This is called the principle of maximum entropy or MaxEnt for short.
The MaxEnt principle makes sense intuitively, sort of, in a way that the principle of indifference on its own doesn’t. If you don’t know something about the probability distribution, you should expect to be surprised more often by the values it generates than if you do know that something. It’s still on fairly shaky ground, though, and I don’t know how far it is accepted by people other than Jaynes as a normative principle, as opposed to just one way in which you might choose the subjective probability distribution to start with, in competition with other ways of doing so on the basis of strictly pragmatic considerations (which seems to be how a lot of people in the practical applications side of things view it). In any case it gives us a philosophical motivation for examining the mathematical problem of finding the probability distributions that have the most entropy given particular constraints. The mathematical problem is interesting itself, but the philosophical connection makes it more interesting.
In part 2 of this post, I’m going to describe how the famous normal distribution can be characterized as a maximum entropy distribution: namely, if it’s the normal distribution with mean (a real number) and standard deviation (a positive real number), then it’s the absolutely probability distribution over with the most entropy among all absolutely continuous probability distributions over with mean and standard deviation . That roughly means that by MaxEnt, if you know nothing about a continuous probability distribution other than that it has a particular mean and a particular standard deviation, your best guess to start with is that it’s normal. You can understand the famous Central Limit Theorem from that perspective: as you add up independent, identically distributed random variables, the mean and variance of the distribution in question will be carried over into the sum (elementary statistical theory tells us that the mean of any sum of random variables is the sum of the individual means, and the variance of any sum of independent random variables is the sum of the individual variances), but every other distinctive property of the distribution is gradually kneaded out by the summing process, so that as the sum gets larger and larger all we can say about the probability distribution of the sum is that it has this particular mean and this particular variance. I intend to finish off part 2 with a proof of the Central Limit theorem from this perspective, although that might be a little ambitious. Before that, though, the other big thing which I need to cover in part 2 is defining entropy in the case of a continuous probability distribution—I’ve only been talking about discrete probability distributions in part 1, and it turns out the extension is not entirely a straightforward matter.