# The mathematics of surprise, part 1

I’ve split this post into two parts because it would be really long otherwise. Part 2 will be coming up later, hopefully.

### Surprisingness

Let’s think about how we might define surprisingness as a mathematical quantity.

Surprisingness is a property of events as perceived by observers. After the event occurs, the observer is surprised to a certain degree. The degree of surprisingness depends on how likely the observer thought it was that the event would occur, or to put it briefly the subjective probability of the event. In fact, to a reasonable approximation, at least, the degree of surprisingness depends only on the subjective probability of the event. The greater the subjective probability, the less the surprise.

For example, if you flip a coin and it comes up heads, this is not very surprising—the probability of this event was only $1/2$, which is reasonably high. But if you flip ten coins and they all come up heads, you are much more surprised, due to the fact that the probability of ten heads in ten flips is only

$\displaystyle \left( \frac 1 2 \right)^{10} = \frac 1 {1024}.$

This is a much smaller quantity than $1/2$, hence the large increase in surprise. Even if you are unable to do the calculation and work out that the probability of ten heads in ten flips is exactly $1/1024$, you are probably aware on an intuitive level that it is a lot smaller than $1/2$.

These considerations suggest surprisingness might be definable as a mapping $S$ on $[0, 1]$ (the set of the real numbers between 0 and 1, inclusive), such that for every $p \in [0, 1]$ (i.e. every member $p$ of $[0, 1]$), the value $S(p)$ is the common surprisingness of the events of subjective probability $p$. Listed below are three properties such a mapping $S$ should have, if it is to be in accordance with the everyday meaning of “surprise”.

Property 1. For every $p \in [0, 1]$ (i.e. every member $p$ of $[0, 1]$), the value $S(p)$ should be a non-negative real number, or $+\infty$ (positive infinity).

Justification. Surprisingness seems to be reasonably well-modelled as (to use statistics jargon) an interval variable: two surprisingness values can be compared to see which is less and which is greater, there is a sensical notion of “distance” between surprisingness values, etc. Therefore, it makes sense for surprisingness to be represented by a real number. Moreover, there is a natural minimum surprisingness, namely the surprisingness of an event of subjective probability 1 (i.e. which the observer was sure would happen)—such an event is not at all surprising, and it makes sense to speak of being, e.g., “twice as surprised” by one event compared to another with reference to this natural minimum. To use statistics jargon again, surprisingness is a ratio variable. Naturally, it makes sense to set the minimum value at 0.

Property 2. The mapping $S$ should be strictly decreasing. That is, for every pair $p, q$ of members of $[0, 1]$ such that $p < q$, we should have $S(q) < S(p)$.

Justification. As said above, events of high subjective probility are less surprising than those of low subjective probability.

Property 3. For every pair $p, q$ of members of $[0, 1]$, we should have $S(pq) = S(p) + S(q).$

Justification. Suppose $A$ and $B$ are independent events of subjective probabilities $p$ and $q$ respectively. Then, assuming subjective probabilities are assigned in a consistent way, obeying the laws of probability, the subjective probability of $A \cap B$ (the event that both $A$ and $B$ occur) is $pq$ and therefore the surprisingness of $A \cap B$ is $S(pq)$. But given that $A$ and $B$ are independent (so the observation of one does not change the observer’s subjective probability of the other), the surprisingness of $A \cap B$ overall, i.e. $S(pq)$, should be the same as the total surprisingness of the individual observations of $A$ and $B$, i.e. $S(p) + S(q)$.

Remarkably, these three properties determine the form of $S$ almost exactly (up to a vertical scale factor). Feel free to skip the proof below; it is somewhat long and tedious and doesn’t use any ideas that will be used later on in the post, and I haven’t put in much effort to make it easy to follow.

Theorem 1. The mappings $S$ on $[0, 1]$ having properties 1–3 above are those of the form $p : [0, 1] \mapsto -\log_b p$ (i.e. those mappings $S$ on $[0, 1]$ such that $S(p) = - \log_b p$ for every $p \in [0, 1]$), where $b$ is a real number greater than 1.

Proof. Suppose $S$ is a mapping on $[0, 1]$ having properties 1–3 above. Let $f$ be the composition of $S$ and $x : \mathbb R \mapsto e^x$. Then the domain of $f$ is the set of the $x \in \mathbb R$ such that $0 \le e^x \le 1$, i.e. $x \le 0$; and for every pair $x, y$ of non-positive real numbers, we have

\begin{aligned} f(x, y) &= S(e^{x + y}) \\ &= S(e^x e^y) \\ &= S(e^x) + S(e^y) \\ &= f(x) + f(y). \end{aligned}

In the case where $x = y = 0$, we see that $f(0 + 0) = f(0) + f(0)$, and we also have $f(0 + 0) = f(0)$ because $0 + 0 = 0$, so combining the two we have $f(0) = f(0) + f(0)$, from which it follows by subtraction of $f(0)$ from both sides that $0 = f(0)$. Similarly, for every non-positive $x \in \mathbb R$ and every non-negative $n \in \mathbb Z$ (the set $\mathbb Z$ is the set of the integers) we have $f((n + 1) x) = f(nx + x) = f(nx) + f(x)$, and if we assume as an inductive hypothesis that $f(nx) = nf(x)$ it follows that $f((n + 1) x) = (n + 1) f(x)$. This proves by induction that $f(nx) = nf(x)$.

Now, let $c = f(-1)$. For every non-positive $x \in \mathbb Q$ (the set $\mathbb Q$ is the set of the rational numbers), there is a non-positive $m \in \mathbb Z$ and a positive $n \in \mathbb Z$ such that $x = m/n$, and therefore

\begin{aligned} nf(x) &= nf \left( \frac m n \right) \\ &= f(m) \\ &= f((-m)(-1)) \\ &= (-m)f(-1) \\ &= -mc, \end{aligned}

from which it follows that $f(x) = -mc/n = -cx$.

The equation $f(x) = -cx$ in fact holds for every non-positive $x \in \mathbb R$, not just for $x \in \mathbb Q$. To prove this, first, observe that $f$ is strictly decreasing, because for every pair $x, y$ of non-positive real numbers such that $x < y$, we have $e^x < e^y$ and therefore $f(y) = S(e^y) < S(e^x) = f(x)$ (the mapping $S$ being strictly decreasing by property 2). Second, suppose $x$ is a non-positive real number and $f(x) \ne -cx$. Then $-f(x)/c \ne x$, i.e. either $-f(x)/c < x$ (case 1) or $x < -f(x)/c$ (case 2). Because every non-empty interval contains a rational number, it follows that there is an $a \in \mathbb Q$ such that $-f(x)/c < a < x$ (in case 1) or $x < a < -f(x)/c$ (in case 2).

In both cases, we have $a \le 0$. This is obvious in case 1, because $x \le 0$. As for case 2, observe first that $S$ is non-negative-valued and therefore $-f(x) \le 0$. If $c$ is positive, it will then follow that $-f(x)/c \le 0$ and therefore $a \le 0$ (because $a < -f(x)/c$). To prove that $c$ is positive, observe that $S$ is strictly decreasing, and we have $f(1) = c$, i.e. $S(1/e) = c$, and $f(0) = 0$, i.e. $S(1) = 0$.

Because $a \le 0$ in both cases, we have $f(a) = -ca$. And by the strict decreasingness of $f$ it follows that either $f(x) < -ca$ (in case 1) or $-ca < f(x)$ (in case 2). Rearranging these two inequalities gives us $a < -f(x)/c$ (in case 1) or $-f(x)/c < a$ (in case 2). But we already have $-f(x)/c < a$ (in case 1) or $a < -f(x)/c$ (in case 2), so in both cases there is a contradiction. It follows that $f(x) \ne -cx$ cannot hold; we must have $f(x) = -cx$.

Finishing off, note that for every $p \in [0, 1]$, we have $S(p) = S(e^{\log p}) = f(\log p) = -c \log p$. Let $b = e^{1/c}$; then $b > 1$ because $c > 0$, and $\log b = 1/c$, i.e. $c = 1/(\log b)$, from which it follows that $S(p) = -(\log p)/(\log b) = -\log_b p$ for every $p \in [0, 1]$. From this we can conclude that for every mapping $S$ on $[0, 1]$ such that properties 1–3 hold, we have

$\displaystyle S = p : [0, 1] \mapsto -\log_b p,$

where $b$ is a real number greater than 1. $\blacksquare$

In the rest of this post, suppose $b$ is a real number greater than 1. The surprisingness mapping $S$ will be defined by

$\displaystyle S = p : [0, 1] \mapsto -\log_b p.$

It doesn’t really make any essential difference to the nature of $S$ which value of $b$ is chosen, because for every $p \in [0, 1]$, we have $S(p) = -\log_b p = -(\log p)/(\log b)$ (where the $\log$ symbol with no base specified refers to the natural logarithm, the one to base $e$), and therefore the choice of base amounts to a choice of vertical scaling. Below are three plots of surprisingness against probability for three different bases (2, $e$ and 10); you can see that the shape of the graph is the same for each base, but the vertical scaling is different.

Note that regardless of the choice of base, we have

\begin{aligned} S(1) &= -\log_b 1 = -0 = 0, \\ S(0) &= -\log_b 0 = -(-\infty) = \infty. \end{aligned}

That is, events thought sure to occur are entirely unsurprising, and events thought sure not to occur are infinitely surprising when they do happen.

Also,

$\displaystyle S \left( \frac 1 b \right) = -\log_b \left( \frac 1 b \right) = -(-1) = 1.$

I’m inclined to say on this basis that 2 is the most sensible choice of base, because of all the possible probabilities other than 0 and 1, the one which is most special and thus most deserving of corresponding to the unit surprisingness is probably $1/2$, the probability exactly halfway between 0 and 1. However, base $e$ is also quite convenient because it simplifies some of the formulae we’ll see later on in this post.

### Entropy

First, two basic definitions from probability theory (for readers who may not be familiar with them). Suppose $A$ is a finite set.

Definition 1. The probability distributions on $A$ are the positive real-valued mappings $f$ on $A$ such that

$\displaystyle (1) \quad \sum_{x \in A} f(x) = 1.$

The probability distributions on $A$ can be thought of as random generators of members of $A$. Each member of $A$ is generated with probability equal to the positive real number less than or equal to 1 associated with that member. The probabilities of generation for each member have to add up to 1, and this is expressed by equation (1). Note that the requirement that the probabilities add up to 1 implies the requirement that the probabilities are each less than or equal to 1, so there is no need to explicitly state the latter requirement in the definition.

Henceforth, suppose $f$ is a probability distribution on $A$.

Definition 2. For every real-valued mapping $\tau$ on $A$, the expected value or mean $\mathrm E_f(\tau)$ of the transformation of $f$ by $\tau$ is given by the formula

$\displaystyle (2) \quad \mathrm E_f(\tau) = \sum_{x \in A} f(x) \tau(x).$

Let $x_1$, $x_2$, … and $x_n$ be the members of $A$ and let $p_1 = f(x_1)$, $p_2 = f(x_2)$, … and $p_n = f(x_n)$. Then for every real-valued mapping $\tau$ on $A$, the expected value $\mathrm E_f(\tau)$ is the weighted average of $\tau(x_1)$, $\tau(x_2)$, … and $\tau(x_n)$, with the weights the respective probabilities $p_1$, $p_2$, … and $p_n$.

If a very large number $N$ of values are generated (independently) by $f$, then the average value under $\tau$ of these values can be calculated as

$\displaystyle \frac 1 N \sum_{k = 1}^n f_k \tau(x_k) = \sum_{k = 1}^n \frac {f_k} N \tau(x_k),$

where $f_1$, $f_2$, … and $f_n$ are the frequencies of the values $x_1$, $x_2$, … and $x_n$ in the sample. Given that $N$ is very large it is very likely that the relative frequencies $f_1/N$, $f_2/N$, … and $f_n/N$ will be close to $p_1$, $p_2$, … and $p_n$, respectively, and therefore the average will be close to $\mathrm E_f(\tau)$.

Now, the definition of the key concept of this post:

Definition 3. The expected surprisingness or entropy $\mathrm H_f$ (that’s the Greek capital letter eta, not the Latin capital letter H) of $f$ is $\mathrm E_f(S \circ f)$ (here, $S \circ f$ is the composition of $S$ and $f$, i.e. $x : A \mapsto S(f(x))$).

The use of the word “entropy” here is closely related to the use of the same word in thermodynamics. However, I won’t go into the connection here (I don’t really know enough about thermodynamics to be able to talk intelligently about it).

Having the concept of entropy to hand gives us a fun new question we can ask about any given probability distribution: what is its entropy, or, more evocatively, how surprising is it?

By (2), we have

$\displaystyle (3) \quad \mathrm H_f = \sum_{x \in A} f(x) S(f(x)).$

Using the definition of $S$, this can also be written as

$\displaystyle (4) \quad \mathrm H_f = -\sum_{x \in A} f(x) \log_b f(x).$

Using logarithm properties, it can even be written as

$\displaystyle \quad \mathrm H_f = \log_b \frac 1 {\prod_{x \in A} f(x)^{f(x)}}.$

This shows that $\mathrm H_f$ is the logarithm of the reciprocal of

$(5) \quad \displaystyle \prod_{x \in A} f(x)^{f(x)},$

which is the geometric weighted average of $p_1$, $p_2$, … and $p_n$, with the weights being identical to the values averaged. A geometric weighted average is similar to an ordinary weighted average, except that the values averaged are multiplied rather than added and the weights are exponents rather than factors. The product (5) can therefore be thought of as the “expected probability” of the value generated by $f$. To the extent that $f$ is likely to generate one of the members of $A$ which has a low individual probability of being generated, the product (5) is small and accordingly $\mathrm H_f$ is large. This may help give you a slightly more concrete idea why $\mathrm H_f$ can be thought of as the expected surprisingness of $f$.

Note that the value of $\mathrm H_f$ is determined completely by the probabilities $p_1$, $p_2$, … and $p_n$. The values $x_1$, $x_2$, … and $x_n$ are irrelevant in and of themselves. This is evident from the formula (3), in which the expression $x$ only appears as a sub-expression of $f(x)$. To understand how $\mathrm H_f$ is determined by these probabilities, it helps to look at the graph of the mapping $p : (0, 1] \mapsto p S(p)$, which I have plotted below. The entropy $\mathrm H_f$ is a sum of exactly $n$ values of this mapping.

It can be seen from the graph that the mapping has a maximum value. Using calculus, we can figure out what this maximum value is and exactly where it is attained.

Theorem 1. The maximum value attained by $p : [0, 1] \mapsto p S(p)$ is $1/b$, and this maximum is attained at the point $1/b$ (and nowhere else).

Proof. The derivative of $p : (0, +\infty) \mapsto p (-\log_b p)$ is $p : (0, +\infty) \mapsto -(1 + \log_b p)$ (by the product rule), which is positive if $p < 1/b$ (because then $\log_b p < -1$), equal to 0 if $p = 1/b$ (because then $\log_b p = -1$), and negative if $p > 1/b$ (because then $\log_b p > -1$). Therefore $p : (0, \infty) \mapsto p (-\log_b p)$ increases in value from the vicinity of 0 to $1/b$ and decreases in value from $1/b$ towards $+\infty$, which means it attains a minimum at $1/b$. $\blacksquare$

Because $\mathrm H_f$ is a sum of $n$ values of $p : [0, 1] \mapsto p S(p)$, we may conclude that the inequality

$\displaystyle \mathrm H_f \le \frac n b$

always holds. However, this upper bound on the value of $H_f$ can be improved, as we’ll see below. After all, in proving that it holds we’ve made no use of equation (1).

### Cross entropy

The concept of cross entropy is a useful generalization of the concept of entropy.

Suppose $g$ another probability distribution on $A$. Think of $f$ and $g$ as two different models of the random member-of-$A$ generator: $f$ is the right model (or a more right model, if you don’t like the idea of one single right model), and $g$ is an observer’s working model. The probabilities $p_1$, $p_2$, … and $p_n$ can be thought of as the real probabilities of generation of $x_1$, $x_2$, … and $x_n$, respectively, while the probabilities $q_1 = g(x_1)$, $q_2 = g(x_2)$, … and $q_n = g(x_n)$ can be thought of as the observer’s subjective probabilities. Although the real probabilities determine what the observer observes, the subjective probabilities are what determine how surprised the observer is by what they observe. Therefore, if the observer calculates entropy by averaging their surprisingness over a large number of observations, they will get something close to

$\displaystyle \sum_{k = 1}^n p_k S(q_k),$

i.e. $\mathrm E_f(S \circ g)$. This quantity is called the cross entropy from $g$ to $f$ and denoted $\mathrm H(f, g)$. Note that if $f = g$ then $\mathrm H(f, g) = \mathrm H(f) = \mathrm H(g)$; that’s why the concept of cross entropy is a generalization of the concept of entropy. Note also that $\mathrm H(f, g)$ is not necessarily the same as $\mathrm H(g, f)$.

Your intuition should tell you that $\mathrm H(f, g)$ will always be greater than $\mathrm H(f)$, if $f \ne g$. Why? Think about it: $\mathrm H(f, g)$ is the expected surprisingness if the observer has the wrong model, $\mathrm H(f, f)$ is the expected surprisingness if the observer has the right model. Having the right model should lead to the observer being better able to predict which member of $A$ is generated and thus to the observer being less likely to be surprised.

It is indeed the case that

$\displaystyle (6) \quad \mathrm H(f, g) > \mathrm H(f)$

if $\mathrm f \ne g$ (which further reassures us that $S$ is a very good mathematical model of the intuitive notion of surprisingness). The inequality (6) is called Gibbs’ inequality. In order to prove it, first, observe that it may be rewritten as

$\displaystyle (7) \quad \mathrm H(f, g) - \mathrm H(f) > 0.$

Now, the quantity on the left-hand side of (7) has a name of its own: it’s called the Kullback-Leibler divergence from $g$ to $f$ and denoted $\mathrm D(f, g)$. It measures the “penalty” in increased surprisingness the observer gets for having the wrong model; it can also be thought of as a measure of how different the probability distributions $f$ and $g$ are from each other, hence the name “divergence”. As for the “Kullback-Leibler” part, that’s just because mathematicians have come up with lots of different ways of measuring how different two probability distributions are from each other and Kullback and Leibler were the mathematicians who came up with this particular measure. I won’t be referring to any other such measures in this post, however, so whenever I need to refer to Kullback-Leibler divergence again I’ll just refer to it as “divergence”.

So Gibb’s inequality, reformulated, states that the divergence between two unequal probability distributions is always positive. To prove this, it’s helpful to first write out an explicit expression for $\mathrm D(f, g)$:

\displaystyle \begin{aligned} \mathrm D(f, g) &= \mathrm H(f, g) - \mathrm H(f) \\ &= \sum_{x \in A} f(x) S(g(x)) - \sum_{x \in A} f(x) S(f(x)) \\ &= \sum_{x \in A} f(x) (S(g(x)) - S(f(x)) \\ &= \sum_{x \in A} f(x) (\log_b f(x) - \log_b g(x)) \\ &= \sum_{x \in A} f(x) \log_b \frac {f(x)} {g(x)}. \end{aligned}

Second, we prove a lemma.

Lemma 1. For every positive $x \in \mathbb R$, we have $\log_b x \ge (x - 1)/(x \log b)$, where $\log b$ is the natural logarithm (logarithm to base $e$) of $b$, with equality if and only if $x = 1$.

Proof. The inequality in question is equivalent to $\log x \ge (x - 1)/x$ (by multiplication of both sides by $\log b$). The right-hand side of this inequality is equal to $1 - 1/x$. Consider the mapping $x : (0, +\infty) \mapsto \log x - (1 - 1/x)$. The derivative of this mapping is $x : (0, +\infty) \mapsto 1/x - 1/x^2$, which is negative if $x < 1$ (because then $x^2 < x$ and therefore $1/x < 1/x^2$), equal to 0 if $x = 1$ (because then $x^2 = x$) and positive if $x > 1$ (because then $x^2 > x$ and therefore $1/x > 1/x^2$). Therefore $x : (0, +\infty) \mapsto \log x - (1 - 1/x)$ is strictly decreasing on $(0, 1)$ and strictly increasing on $(1, +\infty)$. It follows that its minimum value is attained at the point $1$. And that minimum value is $\log x - (1 - 1/x) = 0 - (1 - 1) = 0 - 0 = 0$. $\blacksquare$

Using Lemma 1, we have

$\displaystyle (8) \quad \log_b \frac {f(x)} {g(x)} \ge \frac {f(x)/g(x) - 1} {(f(x)/g(x)) \log b} = \frac {f(x) - g(x)} {f(x) \log b}$

for every $x \in A$, with equality if and only if $f(x)/g(x) = 1$, i.e. $f(x) = g(x)$. Given that $f \ne g$, there is at least one $x \in A$ such that $f(x) \ne g(x)$ and therefore (8) holds without equality. It follows that

\begin{aligned} \mathrm D(f, g) &> \sum_{x \in A} f(x) \frac {f(x) - g(x)} {f(x) \log b} \\ &= \frac 1 {\log b} \sum_{x \in A} (f(x) - g(x)) \\ &= \frac 1 {\log b} \left( \sum_{x \in A} f(x) - \sum_{x \in A} g(x) \right) \\ &= \frac {1 - 1} {\log b} \\ &= 0, \end{aligned}

which proves Gibbs’ inequality.

Gibbs’ inequality is quite powerful and useful. For example, it can be used to figure out what the maximum possible entropy is. Suppose $g$ is such that $\mathrm H(f, g) = \mathrm H(g)$, regardless of the value of $f$. Then if $f \ne g$, we have $\mathrm H(g) > \mathrm H(f)$ by Gibbs’ inequality, and therefore $\mathrm H(g)$ is the maximum entropy possible for probability distributions on $A$ and $g$ is the unique probability distribution on $A$ whose entropy is $\mathrm H(g)$. Is there a probability distribution $g$ on $A$ such that $\mathrm H(f, g) = \mathrm H(g)$, regardless of the value of $f$? There is indeed. Let $g$ be the uniform distribution on $A$, i.e. the mapping $x : A \mapsto 1/n$ (remember, $n$ is the number of members $A$ has). Then

$\displaystyle \sum_{x \in A} g(x) = \frac n n = 1$

so $g$ is a probability distribution, and regardless of the value of $f$ we have

\begin{aligned} \mathrm H(f, g) &= \sum_{x \in A} f(x) \left(-\log_b \frac 1 n \right) \\ &= \left( \sum_{x \in A} f(x) \right) \log_b n \\ &= \log_b n. \end{aligned}

Therefore, the maximum entropy possible on $A$ is $\log_b n$, and the uniform distribution on $A$ is the probability distribution which attains this maximum.

A consequence of this is that the divergence of $f$ from the uniform distribution on $A$ is given by

$\displaystyle \mathrm D(f, g) = \mathrm H(f, g) - \mathrm H(f) = \log_b - \mathrm H(f)$

which is just the negation of $\mathrm H(f)$ plus a constant $\log_b n$ (well, a constant depending on the size of the set $A$ which $f$ is distributed over). Therefore, among probability distributions on $A$ specifically, entropy can be thought of as a measure of divergence from the uniform distribution. Among probability distributions in general, entropy is a measure of both divergence from the uniform distribution and the size of the distributed-over set.

### Philosophical implications

So far, we’ve seen that the informal concept of surprisingness can be formalized mathematically with quite a high degree of success—one might even say a surprising degree of success, which is fitting—and that’s pretty neat. But there are also some deeper philosophical issues related to all this. I’ve avoided talking about them up till now because philosophy is not really my field, but I couldn’t let them go completely unmentioned.

Suppose that you know that one of $n$ values $x_1$, $_2$, … and $x_n$ values (where $n$ is a positive integer) will be generated by some probability distribution $f$, but you don’t know the probabilities; you have absolutely no information about the probability distribution other than the set of values it may generate. What should your subjective probability distribution be? A possible answer to this question is that you shouldn’t have one—you simply don’t know the probabilities, and that’s all that can be said. And that’s reasonable enough, especially if you don’t like the concept of subjective probability or think it’s incoherent (e.g. if you’re a strict frequentist when it comes to interpreting probability). But if you accept that subjective probabilities are things it makes sense to talk about, well, the whole point of subjective probabilities is that they represent states of knowledge under uncertainty, so there’s no reason in principle to avoid having a subjective probability distribution just because this particular state of knowledge is particularly uncertain. Of course this subjective probability distribution may change as you gather more information—think of it as your “best guess” to start with, to be improved later on.

There are some people who’d say that you can choose your initial “best guess” on an essentially arbitrary basis, according to your own personal whims. No matter which subjective probability distribution you choose to start with, it will get better and better at modelling reality as you change it as you gather more information; the initial state isn’t really important. The initial state is of interest only in that if we have some updating mechanism in mind, we’d like to be able to prove that convergence to the real probability distribution will always happen independently of the initial state.

There is another position which can be taken, however, which is that there is in fact a certain objectively best subjective probability distribution to start with. This position is associated with the marquis de Laplace (1749–1827), who wrote a classic text on probability theory, the Théorie analytique des probabilités (Laplace was also a great contributor to many other fields of mathematics, as mathematicians back then tended to be—they didn’t do specialization back then). In Laplace’s opinion, the correct distribution to start with was the uniform distribution. That is, given that we know nothing about the probabilities, assume they are all the same (after all, we have no reason to believe any one is larger than any other). This principle is called the principle of indifference.

The concept of probability as a whole can be based on the idea of the principle of indifference. The idea would be that on some level, any probability distribution is over a set of interchangeable values and therefore uniform. However, we are often interested only in whether one of a particular class of values is generated (not in which particular member of that class) and the probability of interest in that case is the sum of the probabilities of each of the values in that class, which, because the underlying distribution is uniform, can also be expressed as the ratio of the number of values in the class to the total number of values which may be generated. I don’t know how far this is how Laplace thought of probability; I don’t want to be too hasty to attribute views to a 19th-century author which might be out of context or just incorrect (after all, I can’t read French so all my information about Laplace is second-hand).

It’s not hard to argue with the principle of indifference. It’s quite difficult to think of any reasonable justification for it at all. It was indeed attacked in vehement terms by later probability theorists, such as John Venn (1834–1923) (that’s right, the Venn diagram guy). In modern times, however, it has been championed by the statistical physicist E. T. Jaynes (1922–1998), who also came up with an interesting extension of it.

In the mathematical section of this post, we saw that the uniform distribution over $x_1$, $x_2$, … and $x_n$ was the one with the most entropy, i.e. the one which is “most surprising on average”. Therefore, a natural generalization of the principle of indifference would be to say that in any circumstance in which a subjective probability distribution must be assumed in a situation of uncertainty, one should assume whichever distribution has the most entropy while still being compatible with what is known. For example, if it is known what the mean is then the assumed distribution should be the one with the most entropy among all distributions having that specific mean. This is called the principle of maximum entropy or MaxEnt for short.

The MaxEnt principle makes sense intuitively, sort of, in a way that the principle of indifference on its own doesn’t. If you don’t know something about the probability distribution, you should expect to be surprised more often by the values it generates than if you do know that something. It’s still on fairly shaky ground, though, and I don’t know how far it is accepted by people other than Jaynes as a normative principle, as opposed to just one way in which you might choose the subjective probability distribution to start with, in competition with other ways of doing so on the basis of strictly pragmatic considerations (which seems to be how a lot of people in the practical applications side of things view it). In any case it gives us a philosophical motivation for examining the mathematical problem of finding the probability distributions that have the most entropy given particular constraints. The mathematical problem is interesting itself, but the philosophical connection makes it more interesting.

In part 2 of this post, I’m going to describe how the famous normal distribution can be characterized as a maximum entropy distribution: namely, if it’s the normal distribution with mean $\mu$ (a real number) and standard deviation $\sigma$ (a positive real number), then it’s the absolutely probability distribution over $\mathbb R$ with the most entropy among all absolutely continuous probability distributions over $\mathbb R$ with mean $\mu$ and standard deviation $\sigma$. That roughly means that by MaxEnt, if you know nothing about a continuous probability distribution other than that it has a particular mean and a particular standard deviation, your best guess to start with is that it’s normal. You can understand the famous Central Limit Theorem from that perspective: as you add up independent, identically distributed random variables, the mean and variance of the distribution in question will be carried over into the sum (elementary statistical theory tells us that the mean of any sum of random variables is the sum of the individual means, and the variance of any sum of independent random variables is the sum of the individual variances), but every other distinctive property of the distribution is gradually kneaded out by the summing process, so that as the sum gets larger and larger all we can say about the probability distribution of the sum is that it has this particular mean and this particular variance. I intend to finish off part 2 with a proof of the Central Limit theorem from this perspective, although that might be a little ambitious. Before that, though, the other big thing which I need to cover in part 2 is defining entropy in the case of a continuous probability distribution—I’ve only been talking about discrete probability distributions in part 1, and it turns out the extension is not entirely a straightforward matter.