# Category Archives: Mathematics

## Pure and mixed strategies in prediction

Consider the following very simple game: a Bernoulli trial (a trial which results in one of two possible outcomes, labelled “success” and “failure”) is carried out with success probability $p$. Beforehand, you are told the value of $p$ and asked to give a definite prediction of the trial’s outcome. That is, you have to predict either success or failure; just saying “the probability of success is $p$” is not enough. You win if and only if you predict the correct outcome.

Here are two reasonable-sounding strategies for this game:

1. If $p > 0.5$, predict success. If $p < 0.5$, predict failure. If $p = 0.5$, predict success with probability 0.5 and failure with probability 0.5.
2. Predict success with probability $p$ and failure with probability $1 - p$.

In game-theoretic language, the difference between strategies 1 and 2 is that strategy 1 involves the use of a pure strategy if possible, i.e. one in which the choice of what to predict is made deterministically, while strategy 2 is always mixed, i.e. the choice of what to predict is made randomly.

But which is better? Note that the answer may depend on the value of $p$. Try to think about it for a minute before moving on to the next paragraph.

If $p = 0.5$, then the strategies are identical and therefore both equally good.

If $p \ne 0.5$, let $q$ be the probability of the more probable outcome (i.e. $p$ if $p > 0.5$ and $1 - p$ if $p < 0.5$). If the more probable outcome happens, then you win for sure under strategy 1 but you only have probability $q$ of winning under strategy 2. If the less probable outcome happens, then you lose for sure under strategy 1 but you still have probability $1 - q$ of winning under strategy 2. Therefore the probability of winning is $q \cdot 1 + (1 - q) \cdot 0 = q$ under strategy 1 and $q \cdot q + (1 - q) \cdot (1 - q) = 1 - 2q(1 - q)$ under strategy 2. So strategy 1 is better than strategy 2 if and only if

$\displaystyle q > 1 - 2q(1 - q),$

i.e.

$\displaystyle 3q - 2q^2 - 1 > 0.$

This quadratic inequality holds if and only if $0.5 < q < 1$. But $q$ is the probability of the more probable outcome, and therefore $q > 0.5$ for sure. Therefore, strategy 1 is always better if $p \ne 0.5$.

I find this result weird and a little counterintuitive when it’s stated so abstractly. It seems to me like the most natural way of obtaining a definite value from the distribution—drawing randomly from the distribution—should be the best one.

But I guess it does makes sense, if you think about it as applying to a concrete situation. For example, if you were on a jury and you thought there was a $1/1024$ probability that the defendant was guilty, it would be crazy to then flip 10 coins and precommit to arguing for the defendant’s guilt if every one of them came up heads. The other jurors would think you were mad (and probably be very angry with you, if they did all come up heads).

The result has interesting implications for how people should act on their beliefs. If you believe that degrees of belief can be usefully modelled as probabilities, and you try to apply this in everyday reasoning, you will often be faced with the problem of deciding whether to act in accordance with a belief’s truth even if you only place a certain probability $p$ on that belief being true. Should you always act in accordance with the belief if $p > 0.5$, or should you have probability $p$ of acting in accordance with it at any given time? Until I wrote this post it wasn’t obvious to me, but the result in this post suggests you should do the former.

I do wonder if there is anything strategy 2 is good for, though. Comment if you have an idea!

## Modelling communication systems

One of the classes I’m taking this term is about modelling the evolution of communication systems. Everything in the class is done via simulation, which is probably the best way to do it, and certainly necessary at the point where it starts to involve genetic algorithms and such. However, some of the earlier content in the class dealt with problems that I suspected were solvable by a purely mathematical approach, so as somebody with a maths degree I felt it necessary to rise to the challenge and try to derive the solutions mathematically. This post is my attempt to do that.

Let us begin by thinking very abstractly about a system which takes something in and gives something out. Suppose there is a finite, positive number m of things which may be taken in (possible inputs), which we shall call input 1, input 2, … and input m. Suppose likewise that there is a finite, positive number n of things which may be given out (possible outputs), which we shall call output 1, output 2, … and output n.

One way in which the behavior of such a system could be modelled is as a straightforward mapping from inputs to outputs. However, this might be too deterministic: perhaps the system doesn’t always output the same output for a given input. So let’s use a more general model, and think of the system as a mapping from inputs to probability distributions over outputs. For every pair (i, j) of integers such that 0 ≤ im and 0 ≤ jn, let pi, j denote the probability that input i is mapped to output j. The mapping as a whole is determined by the mn probabilities of the form pi, j, and therefore it can be thought of as an m-by-n matrix A:

$\displaystyle \mathbf A = \left( \begin{matrix} p_{1, 1} & p_{1, 2} & \hdots & p_{1, n} \\ p_{2, 1} & p_{2, 2} & \hdots & p_{2, n} \\ \vdots & \vdots & \ddots & \vdots \\ p_{m, 1} & p_{m, 2} & \hdots & p_{m, n} \end{matrix} \right).$

The rows of A correspond to the possible inputs and the columns of A correspond to the possible outputs. Probabilities are non-negative real numbers, so A is a non-negative real matrix. Also, the probabilities of mutually exclusive, exhaustive outcomes sum to 1, so the sum of each row of A is 1. This condition can be expressed as a system of linear equations:

\displaystyle \begin{aligned} p_{1, 1} &+ p_{1, 2} &+ \hdots &+ p_{1, n} &= 1 \\ p_{2, 1} &+ p_{2, 2} &+ \hdots &+ p_{2, n} &= 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ p_{m, 1} &+ p_{m, 2} &+ \hdots &+ p_{m, n} &= 1. \end{aligned}

Alternatively, and more compactly, it may be expressed as the matrix equation

$\displaystyle (1) \quad \mathbf A \mathbf x = \mathbf y,$

where x is the n-dimensional vector whose components are all equal to 1 and y is the m-dimensional vector whose components are all equal to 1.

In general, if x is an n-dimensional vector, and we think of x as a random variable determined by the output of the system, then Ax is the vector of expected values of x conditional on each input. That is, for every integer i such that 1 ≤ im, the ith component of Ax is the expected value of x conditional on meaning i being the input to the system.

Accordingly, if we have not just one, but p n-dimensional vectors x1, x2, … and xp (where p is a positive integer), we can think of these p vectors as the columns of an n-by-p matrix B, and then we can read off all the expected values from the matrix product

$\displaystyle \mathbf A \mathbf B = \mathbf A \mathbf x_1 + \mathbf A \mathbf x_2 + \dotsb + \mathbf A \mathbf x_n$

like so: for every pair (i, k) of integers such that 0 ≤ im and 0 ≤ kp, the (i, k) entry of AB is the expected value of xk conditional on meaning i being the input to the system.

In the case where B happens to be another non-negative real matrix such that

$\displaystyle \mathbf B \mathbf x = \mathbf y,$

so that the entries of B can be interpreted as probabilities, the matrix B as a whole can be interpreted as another input-output system whose possible inputs happen to be the same as the possible outputs of A. In order to emphasize this identity, let us now call the possible outputs of A (= the possible inputs of B) the signals: signal 1, signal 2, … and signal n. The other things—the possible inputs of A, and the possible outputs of B—can be thought of as meanings. Note that there is no need at the moment for the input meanings (the possible inputs of A) to be the same as the output meanings (the possible outputs of B); we make a distinction between the input meanings and the output meanings.

Together, A and B can be thought of as comprising a “product system” which works like this: an input meaning goes into A, a signal comes out of A, the signal goes into B, and an output meaning comes out of B. For every integer k such that 0 ≤ kp, the random variable xk (the kth column of B) can now be interpreted as the probability of the product system outputting output meaning k, as a random variable whose value is determined by the signal. That is, for every integer j such that 0 ≤ jn, the jth component of xk (the (j, k) entry of B) is the probability of output meaning k coming out if the signal happens to be signal j. It follows by the law of total probability that the probability of output meaning k coming out, if i is the input meaning, is the expected value of xk conditional on i being the input meaning. Now, by what we said a couple of paragraphs above, we have that for every integer i such that 0 ≤ im, the expected value of xk conditional on i being the input meaning is the (i, k) entry of AB. So the “product system”, as a matrix, is the matrix product AB. That’s why we call it the “product system”, see? 🙂

In the case where the possible input meanings are the same as the possible output meanings and m = p, we may think about the “product system” as a communicative dyad. The speaker is A, the hearer is B. The speaker is trying to express a meaning, the input meaning, and producing a signal in order to do so, and the hearer is interpreting that signal to have some meaning, the output meaning. The output meaning the hearer understands is not necessarily the same as the input meaning the speaker was trying to express. If it is different, we may regard the communication as unsuccessful; if it is the same, we may regard the communication as successful.

The key question is: what is the probability that the communication is successful? Given the considerations above, it’s very easy to answer. If the input meaning is i, we’re just looking for the probability that output meaning i given this input meaning. That probability is simply the (i, i) entry of AB, i.e. the ith entry along AB‘s main diagonal.

What if the input meaning isn’t fixed? Then the answer will in general depend on the probability distribution over the possible input meanings. But in the simplest case, where the distribution is uniform (no input meaning is any more probable than any other), the probability of successful communication is just the mean of the input meaning-specific probabilities, that is, the sum of the main diagonal entries of AB, divided by m (the number of the main diagonal entries, i.e. the number of meanings). In linear algebra, we call the sum of the main diagonal entries of a square matrix its trace, and we denote it by tr(C) where C is the matrix. So our formula for the communication success probability p is

$\displaystyle (2) \quad p = \frac {\mathrm{tr}(\mathbf A \mathbf B)} m.$

If the probability distribution over the input meanings isn’t uniform, the probability of successful communication is just the weighted average of the input meaning-specific probabilities, with the weights being the respective input meaning probabilities. The general formula can therefore be written as

$(3) \quad \displaystyle p = \mathrm{tr}(\mathbf A \mathbf B \mathbf D) = \mathrm{tr}(\mathbf D \mathbf A \mathbf B)$

where D is the diagonal matrix of size m whose main diagonal is the probability distribution over the input meanings (i.e. for every integer i such that 0 ≤ im, the ith diagonal entry of D is the probability of input meaning i being the one the speaker tries to express). It doesn’t matter whether D is left-multiplied or right-multiplied, because the trace of the product is the same in either case. In the case where the probability distribution over the input meanings is uniform the diagonal entries of D are all equal to $1/m$, i.e $\mathbf D = \mathbf I_m/m$, where Im is the identity matrix of size m, and therefore (3) reduces to (2).

To leave you fully convinced that this formula works, here are some simulations. The 5 graphs below were generated using a Python script which you can view on GitHub. Each one involves 3 possible meanings, 3 possible signals, randomly-generated speaker and hearer matrices and a randomly-generated probability distribution over the input meanings. If you look at the code, you’ll see that the blue line is generated by simulating communication in the obvious way, by randomly drawing an input meaning, randomly drawing a signal based on that particular input meaning, and finally randomly drawing an output meaning based on that particular signal. The position on the x-axis corresponds to the number of trials (individual simulated communicative acts) carried out so far and the position on the y-axis corresponds to the proportion of those trials involving a successful communication (one where the output meaning ended up being the same as the input meaning). For each graph, there were 10 sets of 500 trials; each individual set of trials corresponds to one of the light blue lines, while the darker blue lines gives the results averaged over those ten sets. The horizontal green line indicates the success probability as calculated by our formula. This should be close to the success proportion for a large number of trials, so we should see the blue and green lines converging on the right side of each graph. That is what we see, so the formula works.

## The mathematics of surprise, part 1

I’ve split this post into two parts because it would be really long otherwise. Part 2 will be coming up later, hopefully.

### Surprisingness

Let’s think about how we might define surprisingness as a mathematical quantity.

Surprisingness is a property of events as perceived by observers. After the event occurs, the observer is surprised to a certain degree. The degree of surprisingness depends on how likely the observer thought it was that the event would occur, or to put it briefly the subjective probability of the event. In fact, to a reasonable approximation, at least, the degree of surprisingness depends only on the subjective probability of the event. The greater the subjective probability, the less the surprise.

For example, if you flip a coin and it comes up heads, this is not very surprising—the probability of this event was only $1/2$, which is reasonably high. But if you flip ten coins and they all come up heads, you are much more surprised, due to the fact that the probability of ten heads in ten flips is only

$\displaystyle \left( \frac 1 2 \right)^{10} = \frac 1 {1024}.$

This is a much smaller quantity than $1/2$, hence the large increase in surprise. Even if you are unable to do the calculation and work out that the probability of ten heads in ten flips is exactly $1/1024$, you are probably aware on an intuitive level that it is a lot smaller than $1/2$.

These considerations suggest surprisingness might be definable as a mapping $S$ on $[0, 1]$ (the set of the real numbers between 0 and 1, inclusive), such that for every $p \in [0, 1]$ (i.e. every member $p$ of $[0, 1]$), the value $S(p)$ is the common surprisingness of the events of subjective probability $p$. Listed below are three properties such a mapping $S$ should have, if it is to be in accordance with the everyday meaning of “surprise”.

Property 1. For every $p \in [0, 1]$ (i.e. every member $p$ of $[0, 1]$), the value $S(p)$ should be a non-negative real number, or $+\infty$ (positive infinity).

Justification. Surprisingness seems to be reasonably well-modelled as (to use statistics jargon) an interval variable: two surprisingness values can be compared to see which is less and which is greater, there is a sensical notion of “distance” between surprisingness values, etc. Therefore, it makes sense for surprisingness to be represented by a real number. Moreover, there is a natural minimum surprisingness, namely the surprisingness of an event of subjective probability 1 (i.e. which the observer was sure would happen)—such an event is not at all surprising, and it makes sense to speak of being, e.g., “twice as surprised” by one event compared to another with reference to this natural minimum. To use statistics jargon again, surprisingness is a ratio variable. Naturally, it makes sense to set the minimum value at 0.

Property 2. The mapping $S$ should be strictly decreasing. That is, for every pair $p, q$ of members of $[0, 1]$ such that $p < q$, we should have $S(q) < S(p)$.

Justification. As said above, events of high subjective probility are less surprising than those of low subjective probability.

Property 3. For every pair $p, q$ of members of $[0, 1]$, we should have $S(pq) = S(p) + S(q).$

Justification. Suppose $A$ and $B$ are independent events of subjective probabilities $p$ and $q$ respectively. Then, assuming subjective probabilities are assigned in a consistent way, obeying the laws of probability, the subjective probability of $A \cap B$ (the event that both $A$ and $B$ occur) is $pq$ and therefore the surprisingness of $A \cap B$ is $S(pq)$. But given that $A$ and $B$ are independent (so the observation of one does not change the observer’s subjective probability of the other), the surprisingness of $A \cap B$ overall, i.e. $S(pq)$, should be the same as the total surprisingness of the individual observations of $A$ and $B$, i.e. $S(p) + S(q)$.

Remarkably, these three properties determine the form of $S$ almost exactly (up to a vertical scale factor). Feel free to skip the proof below; it is somewhat long and tedious and doesn’t use any ideas that will be used later on in the post, and I haven’t put in much effort to make it easy to follow.

Theorem 1. The mappings $S$ on $[0, 1]$ having properties 1–3 above are those of the form $p : [0, 1] \mapsto -\log_b p$ (i.e. those mappings $S$ on $[0, 1]$ such that $S(p) = - \log_b p$ for every $p \in [0, 1]$), where $b$ is a real number greater than 1.

Proof. Suppose $S$ is a mapping on $[0, 1]$ having properties 1–3 above. Let $f$ be the composition of $S$ and $x : \mathbb R \mapsto e^x$. Then the domain of $f$ is the set of the $x \in \mathbb R$ such that $0 \le e^x \le 1$, i.e. $x \le 0$; and for every pair $x, y$ of non-positive real numbers, we have

\begin{aligned} f(x, y) &= S(e^{x + y}) \\ &= S(e^x e^y) \\ &= S(e^x) + S(e^y) \\ &= f(x) + f(y). \end{aligned}

In the case where $x = y = 0$, we see that $f(0 + 0) = f(0) + f(0)$, and we also have $f(0 + 0) = f(0)$ because $0 + 0 = 0$, so combining the two we have $f(0) = f(0) + f(0)$, from which it follows by subtraction of $f(0)$ from both sides that $0 = f(0)$. Similarly, for every non-positive $x \in \mathbb R$ and every non-negative $n \in \mathbb Z$ (the set $\mathbb Z$ is the set of the integers) we have $f((n + 1) x) = f(nx + x) = f(nx) + f(x)$, and if we assume as an inductive hypothesis that $f(nx) = nf(x)$ it follows that $f((n + 1) x) = (n + 1) f(x)$. This proves by induction that $f(nx) = nf(x)$.

Now, let $c = f(-1)$. For every non-positive $x \in \mathbb Q$ (the set $\mathbb Q$ is the set of the rational numbers), there is a non-positive $m \in \mathbb Z$ and a positive $n \in \mathbb Z$ such that $x = m/n$, and therefore

\begin{aligned} nf(x) &= nf \left( \frac m n \right) \\ &= f(m) \\ &= f((-m)(-1)) \\ &= (-m)f(-1) \\ &= -mc, \end{aligned}

from which it follows that $f(x) = -mc/n = -cx$.

The equation $f(x) = -cx$ in fact holds for every non-positive $x \in \mathbb R$, not just for $x \in \mathbb Q$. To prove this, first, observe that $f$ is strictly decreasing, because for every pair $x, y$ of non-positive real numbers such that $x < y$, we have $e^x < e^y$ and therefore $f(y) = S(e^y) < S(e^x) = f(x)$ (the mapping $S$ being strictly decreasing by property 2). Second, suppose $x$ is a non-positive real number and $f(x) \ne -cx$. Then $-f(x)/c \ne x$, i.e. either $-f(x)/c < x$ (case 1) or $x < -f(x)/c$ (case 2). Because every non-empty interval contains a rational number, it follows that there is an $a \in \mathbb Q$ such that $-f(x)/c < a < x$ (in case 1) or $x < a < -f(x)/c$ (in case 2).

In both cases, we have $a \le 0$. This is obvious in case 1, because $x \le 0$. As for case 2, observe first that $S$ is non-negative-valued and therefore $-f(x) \le 0$. If $c$ is positive, it will then follow that $-f(x)/c \le 0$ and therefore $a \le 0$ (because $a < -f(x)/c$). To prove that $c$ is positive, observe that $S$ is strictly decreasing, and we have $f(1) = c$, i.e. $S(1/e) = c$, and $f(0) = 0$, i.e. $S(1) = 0$.

Because $a \le 0$ in both cases, we have $f(a) = -ca$. And by the strict decreasingness of $f$ it follows that either $f(x) < -ca$ (in case 1) or $-ca < f(x)$ (in case 2). Rearranging these two inequalities gives us $a < -f(x)/c$ (in case 1) or $-f(x)/c < a$ (in case 2). But we already have $-f(x)/c < a$ (in case 1) or $a < -f(x)/c$ (in case 2), so in both cases there is a contradiction. It follows that $f(x) \ne -cx$ cannot hold; we must have $f(x) = -cx$.

Finishing off, note that for every $p \in [0, 1]$, we have $S(p) = S(e^{\log p}) = f(\log p) = -c \log p$. Let $b = e^{1/c}$; then $b > 1$ because $c > 0$, and $\log b = 1/c$, i.e. $c = 1/(\log b)$, from which it follows that $S(p) = -(\log p)/(\log b) = -\log_b p$ for every $p \in [0, 1]$. From this we can conclude that for every mapping $S$ on $[0, 1]$ such that properties 1–3 hold, we have

$\displaystyle S = p : [0, 1] \mapsto -\log_b p,$

where $b$ is a real number greater than 1. $\blacksquare$

In the rest of this post, suppose $b$ is a real number greater than 1. The surprisingness mapping $S$ will be defined by

$\displaystyle S = p : [0, 1] \mapsto -\log_b p.$

It doesn’t really make any essential difference to the nature of $S$ which value of $b$ is chosen, because for every $p \in [0, 1]$, we have $S(p) = -\log_b p = -(\log p)/(\log b)$ (where the $\log$ symbol with no base specified refers to the natural logarithm, the one to base $e$), and therefore the choice of base amounts to a choice of vertical scaling. Below are three plots of surprisingness against probability for three different bases (2, $e$ and 10); you can see that the shape of the graph is the same for each base, but the vertical scaling is different.

Note that regardless of the choice of base, we have

\begin{aligned} S(1) &= -\log_b 1 = -0 = 0, \\ S(0) &= -\log_b 0 = -(-\infty) = \infty. \end{aligned}

That is, events thought sure to occur are entirely unsurprising, and events thought sure not to occur are infinitely surprising when they do happen.

Also,

$\displaystyle S \left( \frac 1 b \right) = -\log_b \left( \frac 1 b \right) = -(-1) = 1.$

I’m inclined to say on this basis that 2 is the most sensible choice of base, because of all the possible probabilities other than 0 and 1, the one which is most special and thus most deserving of corresponding to the unit surprisingness is probably $1/2$, the probability exactly halfway between 0 and 1. However, base $e$ is also quite convenient because it simplifies some of the formulae we’ll see later on in this post.

### Entropy

First, two basic definitions from probability theory (for readers who may not be familiar with them). Suppose $A$ is a finite set.

Definition 1. The probability distributions on $A$ are the positive real-valued mappings $f$ on $A$ such that

$\displaystyle (1) \quad \sum_{x \in A} f(x) = 1.$

The probability distributions on $A$ can be thought of as random generators of members of $A$. Each member of $A$ is generated with probability equal to the positive real number less than or equal to 1 associated with that member. The probabilities of generation for each member have to add up to 1, and this is expressed by equation (1). Note that the requirement that the probabilities add up to 1 implies the requirement that the probabilities are each less than or equal to 1, so there is no need to explicitly state the latter requirement in the definition.

Henceforth, suppose $f$ is a probability distribution on $A$.

Definition 2. For every real-valued mapping $\tau$ on $A$, the expected value or mean $\mathrm E_f(\tau)$ of the transformation of $f$ by $\tau$ is given by the formula

$\displaystyle (2) \quad \mathrm E_f(\tau) = \sum_{x \in A} f(x) \tau(x).$

Let $x_1$, $x_2$, … and $x_n$ be the members of $A$ and let $p_1 = f(x_1)$, $p_2 = f(x_2)$, … and $p_n = f(x_n)$. Then for every real-valued mapping $\tau$ on $A$, the expected value $\mathrm E_f(\tau)$ is the weighted average of $\tau(x_1)$, $\tau(x_2)$, … and $\tau(x_n)$, with the weights the respective probabilities $p_1$, $p_2$, … and $p_n$.

If a very large number $N$ of values are generated (independently) by $f$, then the average value under $\tau$ of these values can be calculated as

$\displaystyle \frac 1 N \sum_{k = 1}^n f_k \tau(x_k) = \sum_{k = 1}^n \frac {f_k} N \tau(x_k),$

where $f_1$, $f_2$, … and $f_n$ are the frequencies of the values $x_1$, $x_2$, … and $x_n$ in the sample. Given that $N$ is very large it is very likely that the relative frequencies $f_1/N$, $f_2/N$, … and $f_n/N$ will be close to $p_1$, $p_2$, … and $p_n$, respectively, and therefore the average will be close to $\mathrm E_f(\tau)$.

Now, the definition of the key concept of this post:

Definition 3. The expected surprisingness or entropy $\mathrm H_f$ (that’s the Greek capital letter eta, not the Latin capital letter H) of $f$ is $\mathrm E_f(S \circ f)$ (here, $S \circ f$ is the composition of $S$ and $f$, i.e. $x : A \mapsto S(f(x))$).

The use of the word “entropy” here is closely related to the use of the same word in thermodynamics. However, I won’t go into the connection here (I don’t really know enough about thermodynamics to be able to talk intelligently about it).

Having the concept of entropy to hand gives us a fun new question we can ask about any given probability distribution: what is its entropy, or, more evocatively, how surprising is it?

By (2), we have

$\displaystyle (3) \quad \mathrm H_f = \sum_{x \in A} f(x) S(f(x)).$

Using the definition of $S$, this can also be written as

$\displaystyle (4) \quad \mathrm H_f = -\sum_{x \in A} f(x) \log_b f(x).$

Using logarithm properties, it can even be written as

$\displaystyle \quad \mathrm H_f = \log_b \frac 1 {\prod_{x \in A} f(x)^{f(x)}}.$

This shows that $\mathrm H_f$ is the logarithm of the reciprocal of

$(5) \quad \displaystyle \prod_{x \in A} f(x)^{f(x)},$

which is the geometric weighted average of $p_1$, $p_2$, … and $p_n$, with the weights being identical to the values averaged. A geometric weighted average is similar to an ordinary weighted average, except that the values averaged are multiplied rather than added and the weights are exponents rather than factors. The product (5) can therefore be thought of as the “expected probability” of the value generated by $f$. To the extent that $f$ is likely to generate one of the members of $A$ which has a low individual probability of being generated, the product (5) is small and accordingly $\mathrm H_f$ is large. This may help give you a slightly more concrete idea why $\mathrm H_f$ can be thought of as the expected surprisingness of $f$.

Note that the value of $\mathrm H_f$ is determined completely by the probabilities $p_1$, $p_2$, … and $p_n$. The values $x_1$, $x_2$, … and $x_n$ are irrelevant in and of themselves. This is evident from the formula (3), in which the expression $x$ only appears as a sub-expression of $f(x)$. To understand how $\mathrm H_f$ is determined by these probabilities, it helps to look at the graph of the mapping $p : (0, 1] \mapsto p S(p)$, which I have plotted below. The entropy $\mathrm H_f$ is a sum of exactly $n$ values of this mapping.

It can be seen from the graph that the mapping has a maximum value. Using calculus, we can figure out what this maximum value is and exactly where it is attained.

Theorem 1. The maximum value attained by $p : [0, 1] \mapsto p S(p)$ is $1/b$, and this maximum is attained at the point $1/b$ (and nowhere else).

Proof. The derivative of $p : (0, +\infty) \mapsto p (-\log_b p)$ is $p : (0, +\infty) \mapsto -(1 + \log_b p)$ (by the product rule), which is positive if $p < 1/b$ (because then $\log_b p < -1$), equal to 0 if $p = 1/b$ (because then $\log_b p = -1$), and negative if $p > 1/b$ (because then $\log_b p > -1$). Therefore $p : (0, \infty) \mapsto p (-\log_b p)$ increases in value from the vicinity of 0 to $1/b$ and decreases in value from $1/b$ towards $+\infty$, which means it attains a minimum at $1/b$. $\blacksquare$

Because $\mathrm H_f$ is a sum of $n$ values of $p : [0, 1] \mapsto p S(p)$, we may conclude that the inequality

$\displaystyle \mathrm H_f \le \frac n b$

always holds. However, this upper bound on the value of $H_f$ can be improved, as we’ll see below. After all, in proving that it holds we’ve made no use of equation (1).

### Cross entropy

The concept of cross entropy is a useful generalization of the concept of entropy.

Suppose $g$ another probability distribution on $A$. Think of $f$ and $g$ as two different models of the random member-of-$A$ generator: $f$ is the right model (or a more right model, if you don’t like the idea of one single right model), and $g$ is an observer’s working model. The probabilities $p_1$, $p_2$, … and $p_n$ can be thought of as the real probabilities of generation of $x_1$, $x_2$, … and $x_n$, respectively, while the probabilities $q_1 = g(x_1)$, $q_2 = g(x_2)$, … and $q_n = g(x_n)$ can be thought of as the observer’s subjective probabilities. Although the real probabilities determine what the observer observes, the subjective probabilities are what determine how surprised the observer is by what they observe. Therefore, if the observer calculates entropy by averaging their surprisingness over a large number of observations, they will get something close to

$\displaystyle \sum_{k = 1}^n p_k S(q_k),$

i.e. $\mathrm E_f(S \circ g)$. This quantity is called the cross entropy from $g$ to $f$ and denoted $\mathrm H(f, g)$. Note that if $f = g$ then $\mathrm H(f, g) = \mathrm H(f) = \mathrm H(g)$; that’s why the concept of cross entropy is a generalization of the concept of entropy. Note also that $\mathrm H(f, g)$ is not necessarily the same as $\mathrm H(g, f)$.

Your intuition should tell you that $\mathrm H(f, g)$ will always be greater than $\mathrm H(f)$, if $f \ne g$. Why? Think about it: $\mathrm H(f, g)$ is the expected surprisingness if the observer has the wrong model, $\mathrm H(f, f)$ is the expected surprisingness if the observer has the right model. Having the right model should lead to the observer being better able to predict which member of $A$ is generated and thus to the observer being less likely to be surprised.

It is indeed the case that

$\displaystyle (6) \quad \mathrm H(f, g) > \mathrm H(f)$

if $\mathrm f \ne g$ (which further reassures us that $S$ is a very good mathematical model of the intuitive notion of surprisingness). The inequality (6) is called Gibbs’ inequality. In order to prove it, first, observe that it may be rewritten as

$\displaystyle (7) \quad \mathrm H(f, g) - \mathrm H(f) > 0.$

Now, the quantity on the left-hand side of (7) has a name of its own: it’s called the Kullback-Leibler divergence from $g$ to $f$ and denoted $\mathrm D(f, g)$. It measures the “penalty” in increased surprisingness the observer gets for having the wrong model; it can also be thought of as a measure of how different the probability distributions $f$ and $g$ are from each other, hence the name “divergence”. As for the “Kullback-Leibler” part, that’s just because mathematicians have come up with lots of different ways of measuring how different two probability distributions are from each other and Kullback and Leibler were the mathematicians who came up with this particular measure. I won’t be referring to any other such measures in this post, however, so whenever I need to refer to Kullback-Leibler divergence again I’ll just refer to it as “divergence”.

So Gibb’s inequality, reformulated, states that the divergence between two unequal probability distributions is always positive. To prove this, it’s helpful to first write out an explicit expression for $\mathrm D(f, g)$:

\displaystyle \begin{aligned} \mathrm D(f, g) &= \mathrm H(f, g) - \mathrm H(f) \\ &= \sum_{x \in A} f(x) S(g(x)) - \sum_{x \in A} f(x) S(f(x)) \\ &= \sum_{x \in A} f(x) (S(g(x)) - S(f(x)) \\ &= \sum_{x \in A} f(x) (\log_b f(x) - \log_b g(x)) \\ &= \sum_{x \in A} f(x) \log_b \frac {f(x)} {g(x)}. \end{aligned}

Second, we prove a lemma.

Lemma 1. For every positive $x \in \mathbb R$, we have $\log_b x \ge (x - 1)/(x \log b)$, where $\log b$ is the natural logarithm (logarithm to base $e$) of $b$, with equality if and only if $x = 1$.

Proof. The inequality in question is equivalent to $\log x \ge (x - 1)/x$ (by multiplication of both sides by $\log b$). The right-hand side of this inequality is equal to $1 - 1/x$. Consider the mapping $x : (0, +\infty) \mapsto \log x - (1 - 1/x)$. The derivative of this mapping is $x : (0, +\infty) \mapsto 1/x - 1/x^2$, which is negative if $x < 1$ (because then $x^2 < x$ and therefore $1/x < 1/x^2$), equal to 0 if $x = 1$ (because then $x^2 = x$) and positive if $x > 1$ (because then $x^2 > x$ and therefore $1/x > 1/x^2$). Therefore $x : (0, +\infty) \mapsto \log x - (1 - 1/x)$ is strictly decreasing on $(0, 1)$ and strictly increasing on $(1, +\infty)$. It follows that its minimum value is attained at the point $1$. And that minimum value is $\log x - (1 - 1/x) = 0 - (1 - 1) = 0 - 0 = 0$. $\blacksquare$

Using Lemma 1, we have

$\displaystyle (8) \quad \log_b \frac {f(x)} {g(x)} \ge \frac {f(x)/g(x) - 1} {(f(x)/g(x)) \log b} = \frac {f(x) - g(x)} {f(x) \log b}$

for every $x \in A$, with equality if and only if $f(x)/g(x) = 1$, i.e. $f(x) = g(x)$. Given that $f \ne g$, there is at least one $x \in A$ such that $f(x) \ne g(x)$ and therefore (8) holds without equality. It follows that

\begin{aligned} \mathrm D(f, g) &> \sum_{x \in A} f(x) \frac {f(x) - g(x)} {f(x) \log b} \\ &= \frac 1 {\log b} \sum_{x \in A} (f(x) - g(x)) \\ &= \frac 1 {\log b} \left( \sum_{x \in A} f(x) - \sum_{x \in A} g(x) \right) \\ &= \frac {1 - 1} {\log b} \\ &= 0, \end{aligned}

which proves Gibbs’ inequality.

Gibbs’ inequality is quite powerful and useful. For example, it can be used to figure out what the maximum possible entropy is. Suppose $g$ is such that $\mathrm H(f, g) = \mathrm H(g)$, regardless of the value of $f$. Then if $f \ne g$, we have $\mathrm H(g) > \mathrm H(f)$ by Gibbs’ inequality, and therefore $\mathrm H(g)$ is the maximum entropy possible for probability distributions on $A$ and $g$ is the unique probability distribution on $A$ whose entropy is $\mathrm H(g)$. Is there a probability distribution $g$ on $A$ such that $\mathrm H(f, g) = \mathrm H(g)$, regardless of the value of $f$? There is indeed. Let $g$ be the uniform distribution on $A$, i.e. the mapping $x : A \mapsto 1/n$ (remember, $n$ is the number of members $A$ has). Then

$\displaystyle \sum_{x \in A} g(x) = \frac n n = 1$

so $g$ is a probability distribution, and regardless of the value of $f$ we have

\begin{aligned} \mathrm H(f, g) &= \sum_{x \in A} f(x) \left(-\log_b \frac 1 n \right) \\ &= \left( \sum_{x \in A} f(x) \right) \log_b n \\ &= \log_b n. \end{aligned}

Therefore, the maximum entropy possible on $A$ is $\log_b n$, and the uniform distribution on $A$ is the probability distribution which attains this maximum.

A consequence of this is that the divergence of $f$ from the uniform distribution on $A$ is given by

$\displaystyle \mathrm D(f, g) = \mathrm H(f, g) - \mathrm H(f) = \log_b - \mathrm H(f)$

which is just the negation of $\mathrm H(f)$ plus a constant $\log_b n$ (well, a constant depending on the size of the set $A$ which $f$ is distributed over). Therefore, among probability distributions on $A$ specifically, entropy can be thought of as a measure of divergence from the uniform distribution. Among probability distributions in general, entropy is a measure of both divergence from the uniform distribution and the size of the distributed-over set.

### Philosophical implications

So far, we’ve seen that the informal concept of surprisingness can be formalized mathematically with quite a high degree of success—one might even say a surprising degree of success, which is fitting—and that’s pretty neat. But there are also some deeper philosophical issues related to all this. I’ve avoided talking about them up till now because philosophy is not really my field, but I couldn’t let them go completely unmentioned.

Suppose that you know that one of $n$ values $x_1$, $_2$, … and $x_n$ values (where $n$ is a positive integer) will be generated by some probability distribution $f$, but you don’t know the probabilities; you have absolutely no information about the probability distribution other than the set of values it may generate. What should your subjective probability distribution be? A possible answer to this question is that you shouldn’t have one—you simply don’t know the probabilities, and that’s all that can be said. And that’s reasonable enough, especially if you don’t like the concept of subjective probability or think it’s incoherent (e.g. if you’re a strict frequentist when it comes to interpreting probability). But if you accept that subjective probabilities are things it makes sense to talk about, well, the whole point of subjective probabilities is that they represent states of knowledge under uncertainty, so there’s no reason in principle to avoid having a subjective probability distribution just because this particular state of knowledge is particularly uncertain. Of course this subjective probability distribution may change as you gather more information—think of it as your “best guess” to start with, to be improved later on.

There are some people who’d say that you can choose your initial “best guess” on an essentially arbitrary basis, according to your own personal whims. No matter which subjective probability distribution you choose to start with, it will get better and better at modelling reality as you change it as you gather more information; the initial state isn’t really important. The initial state is of interest only in that if we have some updating mechanism in mind, we’d like to be able to prove that convergence to the real probability distribution will always happen independently of the initial state.

There is another position which can be taken, however, which is that there is in fact a certain objectively best subjective probability distribution to start with. This position is associated with the marquis de Laplace (1749–1827), who wrote a classic text on probability theory, the Théorie analytique des probabilités (Laplace was also a great contributor to many other fields of mathematics, as mathematicians back then tended to be—they didn’t do specialization back then). In Laplace’s opinion, the correct distribution to start with was the uniform distribution. That is, given that we know nothing about the probabilities, assume they are all the same (after all, we have no reason to believe any one is larger than any other). This principle is called the principle of indifference.

The concept of probability as a whole can be based on the idea of the principle of indifference. The idea would be that on some level, any probability distribution is over a set of interchangeable values and therefore uniform. However, we are often interested only in whether one of a particular class of values is generated (not in which particular member of that class) and the probability of interest in that case is the sum of the probabilities of each of the values in that class, which, because the underlying distribution is uniform, can also be expressed as the ratio of the number of values in the class to the total number of values which may be generated. I don’t know how far this is how Laplace thought of probability; I don’t want to be too hasty to attribute views to a 19th-century author which might be out of context or just incorrect (after all, I can’t read French so all my information about Laplace is second-hand).

It’s not hard to argue with the principle of indifference. It’s quite difficult to think of any reasonable justification for it at all. It was indeed attacked in vehement terms by later probability theorists, such as John Venn (1834–1923) (that’s right, the Venn diagram guy). In modern times, however, it has been championed by the statistical physicist E. T. Jaynes (1922–1998), who also came up with an interesting extension of it.

In the mathematical section of this post, we saw that the uniform distribution over $x_1$, $x_2$, … and $x_n$ was the one with the most entropy, i.e. the one which is “most surprising on average”. Therefore, a natural generalization of the principle of indifference would be to say that in any circumstance in which a subjective probability distribution must be assumed in a situation of uncertainty, one should assume whichever distribution has the most entropy while still being compatible with what is known. For example, if it is known what the mean is then the assumed distribution should be the one with the most entropy among all distributions having that specific mean. This is called the principle of maximum entropy or MaxEnt for short.

The MaxEnt principle makes sense intuitively, sort of, in a way that the principle of indifference on its own doesn’t. If you don’t know something about the probability distribution, you should expect to be surprised more often by the values it generates than if you do know that something. It’s still on fairly shaky ground, though, and I don’t know how far it is accepted by people other than Jaynes as a normative principle, as opposed to just one way in which you might choose the subjective probability distribution to start with, in competition with other ways of doing so on the basis of strictly pragmatic considerations (which seems to be how a lot of people in the practical applications side of things view it). In any case it gives us a philosophical motivation for examining the mathematical problem of finding the probability distributions that have the most entropy given particular constraints. The mathematical problem is interesting itself, but the philosophical connection makes it more interesting.

In part 2 of this post, I’m going to describe how the famous normal distribution can be characterized as a maximum entropy distribution: namely, if it’s the normal distribution with mean $\mu$ (a real number) and standard deviation $\sigma$ (a positive real number), then it’s the absolutely probability distribution over $\mathbb R$ with the most entropy among all absolutely continuous probability distributions over $\mathbb R$ with mean $\mu$ and standard deviation $\sigma$. That roughly means that by MaxEnt, if you know nothing about a continuous probability distribution other than that it has a particular mean and a particular standard deviation, your best guess to start with is that it’s normal. You can understand the famous Central Limit Theorem from that perspective: as you add up independent, identically distributed random variables, the mean and variance of the distribution in question will be carried over into the sum (elementary statistical theory tells us that the mean of any sum of random variables is the sum of the individual means, and the variance of any sum of independent random variables is the sum of the individual variances), but every other distinctive property of the distribution is gradually kneaded out by the summing process, so that as the sum gets larger and larger all we can say about the probability distribution of the sum is that it has this particular mean and this particular variance. I intend to finish off part 2 with a proof of the Central Limit theorem from this perspective, although that might be a little ambitious. Before that, though, the other big thing which I need to cover in part 2 is defining entropy in the case of a continuous probability distribution—I’ve only been talking about discrete probability distributions in part 1, and it turns out the extension is not entirely a straightforward matter.

## Bonferroni’s inequalities and the inclusion-exclusion principle

A non-negative real- or ∞-valued mapping f on a field $\mathcal F$ of sets is said to be finitely additive if and only if for every pair of disjoint sets A and B in $\mathcal F$, we have

$\displaystyle f(A \cup B) = f(A) + f(B).$

The most important examples of finitely additive mappings are measures, including probability measures, although not every finitely additive mapping is a measure (measures are mappings on σ-algebras, which are a special sort of field of sets, that are countably additive, which is a stronger property than finite additivity).

From the definition it is immediately evident that finite additivity allows us to express the value under a mapping f on a field $\mathcal F$ of sets of any binary union of pairwise disjoint sets in $\mathcal F$ in terms of the values under f of the individual sets. In fact, the same can be said for unions of any arity, provided they are pairwise disjoint. For every field $\mathcal F$ of sets, every finitely additive mapping f on $\mathcal F$, every $n \in \mathbb Z_0$[1] and every n-tuple (A1, A2, …, An) of pairwise disjoint sets in $\mathcal F$, we have

$\displaystyle (1) \quad f \left( \bigcup_{k = 1}^n A_k \right) = \sum_{k = 1}^n A_k.$

This can be proven by induction. In the base case of n = 0, we have

\displaystyle \begin{aligned} f(\emptyset) &= f(\emptyset \cup \emptyset) \\ &= f(\emptyset) + f(\emptyset) \end{aligned}

which implies that 0 = f(∅). Now, suppose $m \in \mathbb Z_0$ and (1) holds for every (m + 1)-tuple (A1, A2, …, Am + 1) of pairwise disjoint sets in $\mathcal F$. Then

\displaystyle \begin{aligned} f \left( \bigcup_{k = 1}^{m + 1} A_k \right) &= f \left( \left( \bigcup_{k = 1}^m A_k \right) \cup A_{m + 1} \right) \\ &= f \left( \bigcup_{k = 1}^m A_k \right) + f(A_{m + 1}) \\ &= \sum_{k = 1}^m f(A_k) + f(A_{m + 1}) \\ &= \sum_{k = 1}^{m + 1} f(A_k) \end{aligned}

which completes the proof.

But what about unions of sets in $\mathcal F$ that are not necessarily pairwise disjoint? Can the values under f of such unions be expressed in terms of the values under f of the individual sets then? The answer is no. However, such unions’ values under f can be expressed in terms of the values under f of the individual sets and their intersections, by what is known as the inclusion-exclusion principle. For every $n \in \mathbb Z_0$ and every $(A_1, A_2, \dotsc, A_n) \in \mathcal F^{\times n}$, we have

$\displaystyle (2) \quad f \left( \bigcup_{k = 1}^n A_k \right) = \sum_{\substack{S \subseteq \mathbb Z_1^n \\ S \ne \emptyset}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right).$

The sum on the right-hand side of (2) is a rather complicated one, cumbersome to write down as well as computationally expensive to compute by virtue of its large number of terms (one for every non-empty subset of $\mathbb Z_1^n$, and there are 2n − 1 of those). Therefore, it is also convenient to use Bonferroni’s inequalities, which say that for every $m \in \mathbb Z_0$, we have

$\displaystyle (3) \quad f \left( \bigcup_{k = 1}^n A_k \right) \gtreqless \sum_{\substack{S \subseteq \mathbb Z_1^n \\ 0 < \#S \le m}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} n A_k \right),$

with the sign $\gtreqless$ standing for “greater than or equal to, if m is even; less than or equal to, if m is odd”. The sum on the right-hand side of (3) has terms only for every non-empty subset of $\mathbb Z_1^n$ with no more than m members, and there are only 2m − 1 of those. In particular, if m = 1 the terms are just the values under f of the individual sets A1, A2, … and An and therefore we have

$\displaystyle f \left( \bigcup_{k = 1}^n A_k \right) \le \sum_{k = 1}^n f(A_k),$

which is Boole’s inequality.

Note that Bonferroni’s inequalities hold when mn as well as when m < n. When mn, the sum on the RHS of (3) is exactly the same as the sum on the RHs of (2). Because there are both even and odd integers m such that mn, and any quantity which is both less than or equal to and greater than or equal to another quantity has to be equal to it, it follows that Bonferroni’s inequalities imply that the equation (2) holds and thus generalize the inclusion-exclusion principle.

In order to prove that Bonferroni’s inequalities hold, and thus prove the inclusion-exclusion principle, we can use induction. First, consider the case where m = 0. The only subset of $\mathbb Z_1^n$ of cardinality 0 is the empty one, so in this case the sum on the right-hand side of (3) is empty and (3) therefore reduces to the statement that

$\displaystyle f \left( \bigcup_{k = 1}^n A_k \right) \ge 0,$

which is true because f is non-negative- or ∞-valued.

Now, suppose $M \in \mathbb Z_0$ and Bonferroni’s inequalities hold in the case where m = M, and consider the case where m = M + 1. We shall use another inductive proof, within the inductive proof we’re currently carrying out, to show that Bonferroni’s inequalities hold for every $n \in \mathbb Z_0$ in when m has this particular value. In the case where n = 0, the left-hand side of (3) reduces to f(∅) and the right-hand side is the empty sum once again, because there are no subsets of $\mathbb Z_1^0$ ($\mathbb Z_1^0$ is the set of the integers greater than or equal to 1 and less than or equal to 0, and there are of course no such integers). Because f(∅) = 0 it follows that (3) is true, regardless of the direction of the inequality required.

As for the successor case, suppose that $N \in \mathbb Z_0$ and Bonferroni’s inequalities hold in the case where n = N. Consider the case where n = N + 1. It is helpful at this point to write (3) for the given values of m and n as

$\displaystyle (-1)^{M + 1} (f(A) - a) \ge 0,$

where

\displaystyle \begin{aligned} A &= \bigcup_{k = 1}^{N + 1} A_k, \\ a &= \sum_{\substack{S \subseteq \mathbb Z_1^{N + 1} \\ 0 < \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right). \end{aligned}

If we also let

\displaystyle \begin{aligned} B &= \bigcup_{k = 1}^N A_k, \\ b &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right), \end{aligned}

then we have (−1)M + 1(f(B) − b) ≥ 0 because Bonferroni’s inequalities hold in the case where n = N. Now, how are f(A) and a related to f(B) and b?

We obviously have A = BAN + 1, but B and AN + 1 are not necessarily disjoint so this doesn’t immediately tell us anything about the relationship of f(A) and f(B). However, AN + 1 is certainly disjoint from B \ AN + 1, so we have

$(4) \quad \displaystyle f(A) = f(B \setminus A_{N + 1}) + f(A_{N + 1}).$

That’s about all we can usefully say for the moment. As for a and b, well, the terms of b are a subset of those of a so it’s quite easy to write down the difference ab. If we manipulate that difference a little bit, we can start getting it to look like something that could occur on the right-hand side of (3).

\displaystyle \begin{aligned} a - b &= \sum_{\substack{S \subseteq \mathbb Z_1^{N + 1} \\ N + 1 \in S \\ \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right) \\ &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ \#S \le M}} (-1)^{\#S + 2} f \left( \left( \bigcap_{k \in S} A_k \right) \cap A_{N + 1} \right) \\ &= f(A_{N + 1}) - \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \cap A_{N + 1} \right) \end{aligned}

Let

\displaystyle \begin{aligned} C &= \bigcup_{k = 1}^N A_k \cap A_{N + 1}, \\ c &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \cap A_{N + 1} \right). \end{aligned}

Then we have (−1)M(f(C) − c) ≥ 0, because Bonferroni’s inequalities hold in the case where m = M, and we have ab = f(AN + 1) − c.

Now, if we use the distributivity of intersection over union we can rewrite C as BAN + 1. It follows that B is the disjoint union of C and the set B \ AN + 1 which turned up above when we were contemplating the relationship of f(A) and f(B), and therefore we have f(B) = f(C) + f(B \ AN + 1). Using this new equation we may rewrite (4) as

$(4) \quad \displaystyle f(A) = f(B) - f(C) + f(A_{N + 1}),$

from which it follows that f(A) − f(B) = f(AN + 1) − f(C)—a nicely analogous equation to ab = f(AN + 1) − c. Finally, let us add the two quantities (−1)M + 1(f(B) − b) and (−1)M(f(C) − c) ≥ 0 which we know to be non-negative. The sum of two non-negative quantities is non-negative also, so we have

\displaystyle \begin{aligned} 0 &\le (-1)^{M + 1} (f(B) - b) + (-1)^M (f(C) - c) \\ &= (-1)^{M + 1} (f(B) - b - (f(C) - c)) \\ &= (-1)^{M + 1} (f(B) - f(C) - (b - c)) \\ &= (-1)^{M + 1} (f(A) - f(A_{N + 1}) - (a - f(A_{N + 1}))) \\ &= (-1)^{M + 1} (f(A) - a). \end{aligned}

$\blacksquare$

[1] For every $(a, b) \in \mathbb Z^{\times 2}$, I denote $\{n \in \mathbb Z : a \le n \le b\}$, $\{n \in \mathbb Z : a \le n\}$ and $\{n \in \mathbb Z : n \le b\}$ by $\mathbb Z_a^b$, $\mathbb Z_a$ and $\mathbb Z^b$ respectively.

## A very simple stochastic model of diachronic change

1. The discrete process

1.1. The problem

Consider an entity (for example, a language) which may or may not have a particular property (for example, obligatory coding of grammatical number). For convenience and interpretation-neutrality, we shall say that the entity is positive if it has this property and negative if it does not have this property. Consider the entity as it changes over the course of a number of events (for example, transmissions of the language from one generation to another) in which the entity’s state (whether it is positive or negative) may or may not change. For every nonnegative integer ${n}$, let ${X_n}$ represent the entity’s state after exactly ${n}$ events have occurred, with negativity being represented by 0 and positivity being represented by 1. The initial state ${X_0}$ is a constant parameter of the model, but the states at other times are random variable whose “success” probabilities (i.e. values of 1 under their probability mass functions) are determined by ${X_0}$ and the other parameters of the model.

The other parameters of the model, besides ${X_0}$, are denoted by ${p}$ and ${q}$. These represent the probabilities that an event will change the state from negative to positive or from positive to negative, respectively. They are assumed to be constant across events—this assumption can be thought of as an interpretation of the uniformitarian principle familiar from historical linguistics and other fields. I shall call a change of state from negative to positive a gain and a change of state from positive to negative a loss, so that ${p}$ can be thought of as the gain rate per event and ${q}$ can be thought of as the loss rate per event.

Note that the gain resp. loss probability is ${p}$/${q}$ only if the state is negative resp. positive as the event begins. If the state is already positive resp. negative as the event begins then it is impossible for a further gain resp. loss to occur and therefore the gain resp. loss probability is 0 (but the loss resp. gain probability is ${q}$/${p}$). Thus the random variables ${X_1}$, ${X_2}$, ${X_3}$, … are not necessarily independent of one another.

I am aware that there’s a name for a sequence of random variables that are not necessarily independent of one another, namely “stochastic process”. However, that is about the extent of what I know about stochastic processes. I think the thing I’m talking about in this post is a very simple example of a stochastic process–an appropriate name for it would be the gain-loss process. If you know something about stochastic processes it might seem very trivial, but it was an interesting problem for me to try to figure out knowing nothing already about stochastic processes.

1.2. The solution

Suppose ${n}$ is a nonnegative integer and consider the state ${X_{n + 1}}$ after exactly ${n + 1}$ events have occurred. If the entity is negative as the ${(n + 1)}$th event begins, the probability of gain during the ${(n + 1)}$th event is ${p}$. If the entity is positive as the ${(n + 1)}$th event begins, the probability of loss during the ${(n + 1)}$th event is ${q}$. Now, as the ${(n + 1)}$th event begins, exactly ${n}$ events have already occurred. Therefore the probability that the entity is negative as the ${(n + 1)}$th event begins is ${\mathrm P(X_n = 0)}$ and the probability that the entity is positive as the ${(n + 1)}$th event begins is ${\mathrm P(X_n = 1)}$. It follows by the law of total probability that

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1) &= p (1 - \mathrm P(X_n = 1)) + (1 - q) \mathrm P(X_n = 1) \\ &= p - p \mathrm P(X_n = 1) + \mathrm P(X_n = 1) - q \mathrm P(X_n = 1) \\ &= p - (p - 1 + q) \mathrm P(X_n = 1) \\ &= p + (1 - p - q) \mathrm P(X_n = 1). \end{aligned}

This recurrence relation can be solved using the highly sophisticated method of “use it to find general equations for the first few terms in the sequence, extrapolate the pattern, and confirm that the extrapolation is valid using a proof by induction”. I’ll spare you the laborious first phrase, and just show you the second and third. The solution is

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= p \sum_{i = 0}^{n - 1} (1 - p - q)^i, \\ \mathrm P(X_n = 1 | X_0 = 1) &= 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i. \end{aligned}

Just so you can check that this is correct, the proofs by induction for the separate cases are given below.

Case 1 (${X_0 = 0)}$. Base case. The expression

$\displaystyle p \sum_{i = 0}^{n - 1} (1 - p - q)^i$

evaluates to 0 if ${n = 0}$, because the sum is empty.

Successor case. For every nonnegative integer ${n}$ such that

$\displaystyle \mathrm P(X_n = 1 | X_0 = 0) = p \sum_{i = 0}^{n - 1} (1 - p - q)^i,$

we have

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1 | X_0 = 0) &= p + (1 - p + q) \mathrm P(X_n = 1 | X_0 = 0) \\ &= p + (1 - p - q) p \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= p + p (1 - p - q) \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= p \left( 1 + \sum_{i = 0}^{n - 1} (1 - p - q)^{i + 1} \right) \\ &= p \sum_{j = 0}^n (1 - p - q)^j. \end{aligned}

Case 2 (${X_0 = 1}$). Base case. The expression

$\displaystyle 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i$

evaluates to 1 if ${n = 0}$, because the sum is empty.

Successor case. For every nonnegative integer ${n}$ such that

$\displaystyle \mathrm P(X_n = 1 | X_0 = 1) = 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i,$

we have

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1 | X_0 = 1) &= p + (1 - p + q) \mathrm P(X_n = 1 | X_0 = 1) \\ &= p + (1 - p - q) \left( 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i \right) \\ &= p + 1 - p - q - (1 - p - q) q \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= 1 - q - q (1 - p - q) \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= 1 - q \left( 1 + \sum_{i = 0}^{n - 1} (1 - p - q)^{i + 1} \right) \\ &= 1 - q \sum_{j = 0}^n (1 - p - q)^j. \end{aligned}

I don’t know if there is any way to make sense of why exactly these equations are the way they are; if you have any ideas, I’d be interested to hear your comments. There is a nice way I can see of understanding the difference between the two cases. Consider an additional gain-loss process ${B}$ which changes in tandem with the gain-loss process ${A}$ that we’ve been considering up till just now, so that its state is always the opposite of that of ${A}$. Then the gain rate of ${B}$ is ${q}$ (because if ${B}$ gains, ${A}$ loses) and the lose rate of ${B}$ is ${p}$ (because if ${B}$ loses, ${A}$ gains). And for every nonnegative integer ${n}$, if we let ${Y_n}$ denote the state of ${B}$ after exactly ${n}$ events have occurred, then

$\displaystyle \mathrm P(Y_n = 1) = 1 - \mathrm P(X_n = 1)$

because ${Y_n = 1}$ if and only if ${X_n = 0}$. Of course, we can also rearrange this equation as ${\mathrm P(X_n = 1) = 1 - \mathrm P(Y_n = 1)}$.

Now, we can use the equation for Case 1 above, but with the appropriate variable names for ${B}$ substituted in, to see that

$\displaystyle \mathrm P(Y_n = 1 | Y_0 = 0) = q \sum_{i = 0}^{n - 1} (1 - q - p)^i,$

and it then follows that

$\displaystyle \mathrm P(X_n = 1 | X_0 = 1) = 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i.$

Anyway, you may have noticed that the sum

$\displaystyle \sum_{i = 0}^{n - 1} (1 - p - q)^i$

which appears in both of the equations for ${\mathrm P(X_n = 1)}$ is a geometric progression whose common ratio is ${1 - p - q}$. If ${1 - p - q = 1}$, then ${p + q = 0}$ and therefore ${p = q = 0}$ (because ${p}$ and ${q}$ are probabilities, and therefore non-negative). The probability ${\mathrm P(X_n = 1)}$ is then simply constant at 0 if ${X_0 = 0}$ (because gain is impossible) and constant at 1 if ${X_0 = 1}$ (because loss is impossible). Outside of this very trivial case, we have ${1 - p - q \ne 1}$, and therefore the geometric progression may be written as a fraction as per the well-known formula:

\displaystyle \begin{aligned} \sum_{i = 0}^{n - 1} (1 - p - q)^i &= \frac {1 - (1 - p - q)^n} {1 - (1 - p - q)} \\ &= \frac {1 - (1 - p - q)^n} {p + q}. \end{aligned}

It follows that

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= \frac {p (1 - (1 - p - q)^n)} {p + q}, \\ \mathrm P(X_n = 1 | X_0 = 1) &= 1 - \frac {q (1 - (1 - p - q)^n)} {p + q} \\ &= \frac {p + q - q (1 - (1 - p - q)^n)} {p + q} \\ &= \frac {p + q - q + q (1 - p - q)^n} {p + q} \\ &= \frac {p + q (1 - p - q)^n} {p + q}. \end{aligned}

From these equations it is easy to see the limiting behaviour of the gain-loss process as the number of events approaches ${\infty}$. If ${1 - p - q = -1}$, then ${p + q = 2}$ and therefore ${p = q = 1}$ (because ${p}$ and ${q}$ are probabilities, and therefore not greater than 1). The equations in this case reduce to

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= \frac {1 - (-1)^n} 2, \\ \mathrm P(X_n = 1 | X_0 = 1) &= \frac {1 + (-1)^n} 2, \end{aligned}

which show that the state simply alternates deterministically back and forth between positive and negative (because ${(1 - (-1)^n)/2}$ is 0 if ${n}$ is even and 1 if ${n}$ is odd and ${(1 + (-1)^n)/2}$ is 1 if ${n}$ is even and 0 if ${n}$ is odd).

Otherwise, we have ${|1 - p - q| < 1}$ and therefore

$\displaystyle \lim_{n \rightarrow \infty} (1 - p - q)^n = 0.$

Now the equations for ${\mathrm P(X_n = 1 | X_0 = 0)}$ and ${\mathrm P(X_n = 1 | X_0 = 1)}$ above are the same apart from the term in the numerator which contains ${(1 - p - q)^n}$ as a factor, as well as another factor which is independent of ${n}$. Therefore, regardless of the value of ${X_0}$,

$\displaystyle \lim_{k \rightarrow \infty} \mathrm P(X_k = 1) = \frac p {p + q}.$

This is a nice result: if ${n}$ is sufficiently large, the dependence of ${X_n}$ on ${X_0}$, ${X_1}$, … and ${X_{n - 1}}$ is negligible and its success probability is negligibly different from ${p/(p + q)}$. That it is this exact quantity sort of makes sense: it’s the ratio of the gain rate to the theoretical rate of change of state in either direction that we would get if both a gain and loss could occur in a single event.

In case you like graphs, here’s a graph of the process with ${X_0 = 0}$, ${p = 1/100}$, ${q = 1/50}$ and 500 events. The x-axis is the number of events that have occurred and the y-axis is the observed frequency, divided by 1000, of the state being positive after this number of events has occurred (for the blue line) or the probability of the state being positive according to the equations described in this post (for the green line). If you want to, you can view the Python code that I used to generate this graph (which is actually capable of simulating multiple-trait interactions, although I haven’t tried solving it in that case) on GitHub.

2. The continuous process

2.1. The problem

Let us now consider the same process, but continuous rather than discrete. That is, rather than the gains and losses occuring over the course of a discrete sequence of events, we now have a continuous interval in time, during which at any point losses and gains might occur instantaneously. The state of the process at time ${t}$ shall be denoted ${X(t)}$. Although multiple gains and losses may occur during an arbitrary subinterval, we may assume for the purpose of approximation that during sufficiently short subintervals only one gain or loss, or none, may occur, and the probabilities of gain and loss are directly proportional to the length of the subinterval. Let ${\lambda}$ be the constant of proportionality for gain and let ${\mu}$ be the constant of proportionality for loss. These are the continuous model’s analogues of the ${p}$ and ${q}$ parameters in the discrete model. Note that they may be greater than 1, unlike ${p}$ and ${q}$.

2.2. The solution

Suppose ${t}$ is a non-negative real number and ${n}$ is a positive integer. Let ${\Delta t = 1/n}$. The interval in time from time 0 to time ${t}$ can be divided up into ${n}$ subintervals of length ${\Delta t}$. If ${\Delta t}$ is small enough, so that the approximating assumptions described in the previous paragraph can be made, then the subintervals can be regarded as discrete events, during each of which gain occurs with probability ${\lambda \Delta t}$ if the state at the start point of the subinterval is negative and loss occurs with probability ${\mu \Delta t}$ if the state at the start point of the subinterval is positive. For every positive integer ${k}$ between 0 and ${n}$ inclusive, let ${Y_k}$ denote the state of this discrete approximation of the process at time ${t + k \Delta t}$. Then for every integer ${k}$ between 0 and ${n}$ (inclusive) we have

\displaystyle \begin{aligned} \mathrm P(Y_k = 1 | Y_0 = 0) &= \frac {\lambda \Delta t (1 - (1 - \lambda \Delta t - \mu \Delta t)^k)} {\lambda \Delta t + \mu \Delta t}, \\ \mathrm P(Y_k = 1 | Y_0 = 1) &= \frac {\lambda \Delta t + \mu \Delta t (1 - \lambda \Delta t - \mu \Delta t)^k} {\lambda \Delta t + \mu \Delta t}, \end{aligned}

provided ${\lambda}$ and ${\mu}$ are not both equal to 0 (in which case, just as in the discrete case, the state remains constant at whatever the initial state was).

Many of the ${\Delta t}$ factors in this equation can be cancelled out, giving us

\displaystyle \begin{aligned} \mathrm P(Y_k = 1 | Y_0 = 0) &= \frac {\lambda (1 - (1 - (\lambda + \mu) \Delta t)^k)} {\lambda + \mu}, \\ \mathrm P(Y_k = 1 | Y_0 = 1) &= \frac {\lambda + \mu (1 - (\lambda + \mu) \Delta t)^k} {\lambda + \mu}. \end{aligned}

Now consider the case where ${k = n}$ in the limit ${n}$ approaches ${\infty}$. Note that ${\Delta t}$ approaches 0 at the same time, because ${\Delta t = t/n}$, and therefore the limit of ${(1 - (\lambda + \mu) \Delta t)^n}$ is not simply 0 as in the discrete case. If we rewrite the expression as

$\displaystyle \left( 1 - \frac {t (\lambda + \mu)} n \right)^n$

and make the substitution ${n = -mt(\lambda + \mu)}$, giving us

$\displaystyle \left( 1 + \frac 1 m \right)^{-mt(\lambda + \mu)} = \left( \left( 1 + \frac 1 m \right)^m \right)^{-t(\lambda + \mu)},$

then we see that the limit is in fact ${e^{-t(\lambda + \mu)}}$, an exponential function of ${t}$. It follows that

\displaystyle \begin{aligned} \mathrm P(X(t) = 1 | X(0) = 0) = \lim_{n \rightarrow \infty} \mathrm P(Y_n = 1 | Y_0 = 0) &= \frac {\lambda (1 - e^{-t(\lambda + \mu)})} {\lambda + \mu}, \\ \mathrm P(X(t) = 1 | X(0) = 1) = \lim_{n \rightarrow \infty} \mathrm P(Y_n = 1 | Y_0 = 1) &= \frac {\lambda + \mu e^{-t(\lambda + \mu)}} {\lambda + \mu}. \end{aligned}

This is a pretty interesting result. I initially thought that the continuous process would just have the solution ${\mathrm P(X_n = 1) = \lambda/{\lambda + \mu}}$, completely independent of ${X_0}$ and ${t}$, based on the idea that it could be viewed as a discrete process with an infinitely large number of events within every interval of time, so that it would constantly behave like the discrete process does in the limit as the number of events approaches infinity. In fact it turns out that it still behaves like the discrete process, with the effect of the initial state never quite disappearing—although it does of course disappear in the limit as ${t}$ approaches ${\infty}$, because ${e^{-t(\lambda + \mu)}}$ approaches 0:

$\displaystyle \lim_{t \rightarrow \infty} \mathrm P(X(t) = 1) = \frac {\lambda} {\lambda + \mu}.$

## Dirichlet’s approximation theorem

The definition of rational numbers is usually expressed as follows.

Definition 1 For every real number ${x}$, ${x}$ is rational if and only if there are integers ${p}$ and ${q}$ such that ${q \ne 0}$ and ${x = p/q}$.

Remark 1 For every pair of integers ${p}$ and ${q}$ such that ${q \ne 0}$, ${p/q = (p'/\gcd(p, q))/(|q|/\gcd(p, q))}$, where ${p' = p}$ if ${q > 0}$ and ${p' = -p}$ if ${q < 0}$. Therefore, the definition which is the same as Definition 1 except that ${q}$ is required to be positive and ${p}$ is required to be coprime to ${q}$ is equivalent to Definition 1.

However, there’s a slightly different way one can express the definition, which uses the fact that the equations ${x = p/q}$ and ${q x = p}$ are equivalent.

Definition 2 For every real number ${x}$, ${x}$ is rational if and only if there is a nonzero integer ${q}$ such that ${q x}$ is an integer.

Remark 2 The definition which is the same as Definition 3 except that ${q}$ is required to be positive and ${q x}$ is required to be coprime to ${q}$ is equivalent to Definition 3.

The nice thing about Definition 3 is that it immediately brings to mind the following algorithm for verifying that a real number ${x}$ is rational: iterate through the positive integers in ascending order, and for each positive integer ${q}$ check whether ${q x}$ is an integer. (It’s assumed that it is easy to check whether an arbitrary real number is an integer.) If it is an integer, stop the iteration. The algorithm terminates if and only if ${x}$ is rational. The algorithm is obviously not very useful if it is actually used by a computer to check for rationality—one obvious problem is that it cannot verify irrationality, it can only falsify it. But it is useful as a guuide to thought. Mathematical questions are often easier to think about if they are understood in terms of processes, rather than in terms of relationships between static objects.

In particular, there’s a natural way in which some irrational numbers can be said to be “closer to rational” than others, in terms of this algorithm. If ${x}$ is irrational, then none of the terms in the sequence ${\langle x, 2 x, 3 x, \dotsc \rangle}$ are integers. But how close to integers are the terms? The closer they are to integers, the closer to rational ${x}$ can be said to be.

But how is the closeness of the integers to the terms of the sequence to be measured? There are different ways this can be done. Perhaps the most natural way to start off with is to measure it by the minimum of the distances of the terms from the closest integers to them—that is, the minimum of the set ${\{|q x - p|: p \in \mathbf Z, q \in \mathbf N\}}$. Of course, this minimum may not even exist—it may be possible to make ${|q x - p|}$ arbitrarily small by choosing appropriate integers ${p}$ and ${q}$ such that ${q > 0}$. So the first question to answer is this: for which values of ${x}$ does the minimum exist?

The answer to this question is given by Dirichlet’s approximation theorem.

Theorem 3 (Dirichlet’s approximation theorem) For every real number ${x}$ and every positive integer ${n}$, there are integers ${p}$ and ${q}$ such that ${q > 0}$ and

$\displaystyle |q x - p| < \frac 1 n. \ \ \ \ \ (1)$

Proof: First, let us define some notation. For every real number ${x}$, let ${[x]}$ denote the greatest integer less than or equal to ${x}$ and let ${\{x\}}$ denote ${x - [x]}$. Note that the inequality ${0 \le \{x\} < 1}$ always holds.

Now, suppose ${x}$ is a real number and ${n}$ is a positive integer. The ${n + 1}$ real numbers 0, ${\{x\}}$, ${\{2 x\}}$, … and ${\{n x\}}$ are all in the half-open interval ${I = [0, 1)}$. This interval can be partitioned into the ${n}$ sub-intervals ${I_1 = [0, 1/n}$, ${I_2 = [1/n, 2/n)}$, \dots and ${I_n = [1 - 1/n, 1)}$, each of length ${1/n}$. These ${n + 1}$ real numbers are distributed among these ${n}$ sub-intervals, and since there are more real numbers than sub-intervals at least one of the sub-intervals contains more than one of the real numbers. That is, there are integers ${r}$ and ${s}$ such that ${\{r x\}}$ and ${\{s x\}}$ are in the same sub-interval and hence ${|\{r x\} - \{s x\}| < 1/n}$. Or, equivalently:

$\begin{array}{rcl} 1/n &>& \|\{r x\} - \{s x\}| \\ &=& |(r x - [r x]) - (s x - [s x])| \\ &=& |(r - s) x - ([r x] - [s x])|, \end{array}$

so if we let ${q = r - s}$ and ${p = [r x] - [s x]}$ we have ${|q x - p| < 1/n}$. And ${r}$ and ${s}$ can be chosen so that ${r < s}$ and hence ${q}$ is positive. $\Box$

Dirichlet’s approximation theorem says that for every real number ${x}$, ${q x - p}$ can be made arbitrarily small by choosing appropriate integers ${p}$ and ${q}$ such that ${q > 0}$, and hence the minimum of the set ${\{|q x - p|: p \in \mathbf Z, q \in \mathbf N\}}$ does not exist.

It may not be immediately obvious from the way in which it has been presented here why Dirichlet’s approximation theorem is called an “approximation theorem”. The reason is that if the inequality ${|q x - p| < 1/n}$ is divided through by ${q}$ (which produces an equivalent inequality, given that ${q}$ is positive), the result is

$\displaystyle \left| x - \frac p q \right| < \frac 1 {n q}. \ \ \ \ \ (2)$

So Dirichlet’s approximation theorem can also be interpreted as saying that for every real number ${x}$ and every positive integer ${n}$, it is possible to find a rational approximation ${p/q}$ to ${x}$ (where ${p}$ and ${q}$ are integers and ${q > 0}$) whose error is less than ${1/(nq)}$. In fact, this is how the theorem is usually presented. When it’s presented in this way, Dirichlet’s approximation theorem can be seen as an addendum to the fact that for every positive integer ${n}$, it is possible to find a rational approximation ${p/q}$ to ${x}$ whose error is less than ${1/n}$—that is, arbitrarily small rational approximations exist to every real number. (This is very easily proven—it’s really just another way of expressing the fact that the set of all rational numbers, ${\mathbf Q}$, is dense in the set of all real numbers, ${\mathbf R}$.) After obtaining that result, one might naturally think, “well, in this sense all real numbers are equally well approximable by rational numbers, but perhaps if I make the condition more strict by adding a factor of ${1/q}$ into the quantity the error has to be less than, I can uncover some interesting differences in the rational approximability of different real numbers.” But the relevance of Dirichlet’s approximation theorem can also be understood in a more direct way, and that’s what I wanted to show with this post.

Of course putting this extra factor in doesn’t lead to the discovery of any interesting differences in the rational approximability of of different real numbers. In order to get to the interesting differences, you have to add in yet another factor of ${1/q}$. A real number ${x}$ is said to be well approximable if and only if for every positive integer ${n}$, there are integers ${p}$ and ${q}$ such that ${q > 0}$ and

$\displaystyle |q x - p| < \frac 1 {n q}, \ \ \ \ \ (3)$

or, equivalently,

$\displaystyle \left| x - \frac p q \right| < \frac 1 {n q^2}. \ \ \ \ \ (4)$

Otherwise, ${x}$ is said to be badly approximable. Some real numbers are well approximable, and some are badly approximable.

There is in fact a very neat characterisation of the distinction in terms of continued fractions. The real numbers that are well approximable are precisely those that have arbitrarily large terms in their continued fraction expansion. For example, ${e}$ is well-approximable because its continued fraction expansion is

$\displaystyle [2, 1, 2, 1, 1, 4, 1, 1, 6, 1, 1, \dotsc]$

(Note that the pattern only appears from the third term onwards, so it’s really ${(e - 2)/(3 - e)}$ that has the interesting continued fraction expansion.) Every multiple of 2 appears in this continued fraction expansion, so there are arbitrarily large terms. The real numbers that are badly approximable, on the other hand, are those that have a maximum term in their continued fraction expansion. They include all quadratic irrational numbers (since those numbers have continued fraction expansions which are eventually periodic), as well as others. For example, the real number with the continued fraction expansion

$\displaystyle [1, 2, 1, 1, 2, 1, 1, 1, 2, \dotsc]$

is badly approximable. This distinction is the topic of the final year project I’m currently doing for my mathematics course at university.

I guess it would be possible to motivate the well approximable-badly approximable distinction in a similar way: note that a real number ${x}$ is rational if and only if there is an integer ${q}$ such that ${q^2 x}$ is an integer divisible by ${q}$, and then go on to say that the closeness of rationality of an irrational number ${x}$ can be judged by how close the terms of the sequence ${\langle x, 4 x, 9 x, \dotsc \rangle}$ are to integers that are multiples of ${q}$. The well approximable numbers would be those for which there exist terms of the sequence arbitrarily close to integers. Of course, this is a lot more contrived.

## A controversial maths question

The following question which appeared on an Edexcel GCSE maths paper (for readers outside of the UK, this is a test that would be taken at the end of high school, at the age of 15 or 16) has gone viral, with many students complaining about its difficulty.

There are n sweets in a bag. 6 of the sweets are orange. The rest of the sweets are yellow.

Hannah takes at random a sweet from the bag. She eats the sweet.

Hannah then takes at random another sweet from the bag. She eats the sweet.

The probability that Hannah eats two orange sweets is 1/3.

Show that n2n − 90 = 0.

I sympathise with the students complaining about this question’s difficulty. I don’t think it is an easy question for GCSE students. I found it easy to answer, but I’m studying for a Maths degree.

However, there are different kinds of difficulty. Sometimes a question is difficult because it involves the application of knowledge which is complex and/or less prominently featured in the syllabus, which makes acquiring this knowledge and keeping it in your head more difficult, and also makes it harder to realise when this knowledge needs to be applied. This was not one of those questions. The knowledge required to complete this question was very basic stuff which I expect most of the students complaining about it already knew. As far as I can tell, the following mathematical knowledge is required to answer this question:

• If m of some n objects have a certain property then the probability that one of these n objects, picked at random, will have that certain property is m/n.
• The combined probability of two events with probabilities x and y is xy (as long as the events are independent of each other, although you could get away with not understanding that part for this question).
• Basic algebraic manipulation techniques, so that you can see that the equation (6/n)(5/(n − 1)) = 1/3
• can be rearranged into n2n − 90 = 0. All this takes is knowing how to multiply the two fractions together on the left-hand side (new numerator is the product of the old numerators, new denominator is the product of the old denominators), then taking the reciprocals of both sides and bringing the 30 on the left-hand side over to the right-hand side (there are different ways you could describe this process).

Certainly for some of the students, the knowledge was what was at issue here, but I think the main challenge in this question was the successful application of this knowledge. Most of the students knew the three things listed above, but they failed to see that this knowledge could be used to answer the question.

Unfortunately for the students, failure to apply knowledge successfully is in many ways a much more serious failure than failure to possess the appropriate knowledge. If you fail due to lack of knowledge, there is an obvious step you can take to prevent further failure in the same situation: acquire the knowledge that you lack, by having somebody or something teach you it. Crucially, when you successfully learn something you not only end up knowing it, but you also know that you know it.

If, on the other hand, your problem is that you failed to apply your knowledge successfully, then it is much less clear what the next step you should take is. And, also, you never know whether you are capable of applying your knowledge in every situation where it might be useful, because there are usually a whole lot of different situations where it might be useful and it is impossible for you to be familiar with them all. This is why cramming for tests is not a good idea. Cramming may be the most efficient way to obtain the knowledge you need for a test (I don’t know whether this is actually true, but I don’t think it’s impossible) but it certainly won’t help you with applying your knowledge.

There is one relatively straightforward way to become better at applying your knowledge. If you remember how you applied a certain piece of knowledge in a previous situation, then, if the exact same situation occurs again, you won’t have any trouble, because you will just apply the knowledge in the same way again. In a similar way, the more similar the new situation is to the old one which you remember how to deal with, the more likely you are to be able to successfully apply your knowledge.

But, in a way, this is a trick. I am about to get a bit esoteric here, but bear with me. Let’s say your mind has two “departments” which work together to solve a problem. One of them is the Department of Knowledge-Recall, or the DKR for short, and the other is the Department of Knowledge-Application, or the DKA for short. I think that when you carry out the strategy above, what is happening in your mind is that the DKA is pre-emptively passing on the tough part of the work to the DKR for them to deal with. If you remember how you used a certain piece of knowledge (let’s call it X) to deal with a previous situation, then that memory, in itself, has become knowledge (let’s call it Y). The DKR has worked so that Y is easily recalled. When the situation comes up again, the DKA’s task is really easy. It’s looking at the most prominent pieces of knowledge that the DKR has made you recall, and it notices that Y has a kind of direct link to this situation. If the DKR hadn’t worked to make Y a piece of easily-recalled knowledge for you, the DKA would have to do more work, sifting through the knowledge that the DKR has made you recall, possibly asking the DKR for more, inspecting them more closely for any connection to the situation at hand.

There’s a good chance the above paragraph made no sense. But basically, I’m trying to say that familiarising yourself with the situations where you need to apply your knowledge works because it is a process of acquiring specialised knowledge about where you need to apply your knowledge—it is not truly applying your knowledge. But probably the more important point to make is that this process is inefficient. It would be much simpler if you could simply apply your knowledge to unfamiliar situations straight away. And it’s often impossible to familiarise yourself with every conceivable situation, because the range of conceivable situations is so vast.

Students taking tests try to carry out the familiarisation process by doing past papers. This is often effective because the range of questions that can be on a paper is often quite limited, so it really is possible to familiarise yourself with every situation where you might need to apply your knowledge. But this isn’t a good thing!

As I argued (somewhat incoherently) above, the skill of being able to apply your knowledge is only separate from the skill of being able to recall your knowledge if it includes the sub-skill of being able to apply your knowledge to unfamiliar situations. That particular sub-skill is one which is hard to improve. Arguably, this sub-skill is what we mean by “intelligence” (the word “intelligence” is used to refer to a lot of different things, but this might be the most prototypical thing referred to as “intelligence”). It’s certainly possible to get better at this sub-skill, but I don’t know if it is possible to get better through conscious effort.

But intelligence (by which I mean, the ability to apply your knowledge to unfamiliar situations) is often the main thing the tests that students take are trying to assess. After all, there are few vocations, if any, that a person cannot do better at by being more intelligent. You don’t always need intelligence to get to an acceptable level, sure, but intelligence always helps. The purpose of the GCSEs is not just to find out which students are more or less intelligent—they are also supposed to increase the amount of people who have useful knowledge—but it is one their main purposes.

That’s why I don’t think this question was unfair, as some students have been saying. Yes, it was quite different from anything that was on the past papers. But that was the whole point, to test people’s ability to apply their knowledge in unfamiliar situations. It is natural to be disappointed if you couldn’t answer the question and to complain about your disappointment, but saying that it was unfair is a step too far. It did what it was meant to do.

I think there is possibly a genuine problem here, though. If questions like this which strongly tested intelligence (as defined above) were usual on the GCSE papers, you wouldn’t expect this one to become such a topic of interest. Perhaps the GCSEs have suffered in the past from a lack of questions like this, which has affected students’ expectations of what these tests should be like. I should be clear that I have no idea whether this is true. I took my GCSEs in 2011, but I don’t remember what the questions were like in this respect.

PS: I’ve seen some people saying “this is easy, the answer is 10”. These people are making fools of themselves, because the answer is not 10, in fact that doesn’t even make any sense. The question is asking for a proof. (“Show that…”) It seems these people have just seen the quadratic equation at the end and assumed that the question was “Solve this equation” without actually reading it. Perhaps this is evidence that the question really isn’t easy. Or maybe these people just aren’t thinking about it very much.

## Some thoughts on how people come up with mathematical proofs

Everyone who has studied mathematics and tried to prove things themselves will have realised that the process is quite different from what you might imagine from the proofs given in textbooks. In textbooks, proofs are made up of a sequence of statements which are, roughly, ordered by implication: the proof begins with the hypotheses of the theorem, and the statements derive conclusions from these hypotheses and from the conclusions of previous statements until finally the statement that is to be proved is concluded. But when you are coming up with a proof yourself, you can’t just derive things from the hypotheses without bearing in mind what conclusion you’re aiming for. After all, there are infinitely many statements that could be derived from the hypothesis and only some of these will help lead you towards the inclusion. Instead, you often have to work backwards, trying to find hypotheses that imply the desired conclusion, so that you can aim to conclude the statements of these hypotheses from your original hypotheses, rather than aiming for the desired conclusion directly. Of course, you have to bear in mind what the original hypotheses are while doing this. So you end up constantly alternating between working forwards and working backwards until finally everything links up somewhere in the middle. Deduction can be thought of, in metaphorical terms, as a kind of journey. In the textbooks, you have a group of people, corresponding to the hypotheses, travelling towards their destination, which corresponds to the conclusion. But deduction as a mental process is more like a meeting between two parties.

I thought it might be interesting, therefore, to try and write down how I came up with a proof, to show you how the process works. The thing I’m going to prove is the following statement.

Theorem 1. There is a rational number between every pair of distinct real numbers.

The proof will be different, of course, depending on how you define the real numbers. My proof will use the axiomatic definition of the real numbers. In case you are not familiar with this, the axiomatic definition is the one where we don’t really worry about what the real numbers are; we just list some of the most essential properties of the real numbers, and see what we can prove if we assume that these properties hold. These assumed properties are called our axioms. Although this might seem philosophically dubious, if we can prove something from the axioms, then we can be quite sure that it is a true fact about the real numbers, regardless of how we define them, because we can always be sure that the axioms hold. After all, the axioms are the most essential properties of the real numbers. If we were to define the ‘real numbers’ in such a way that not all of the axioms held, it could be reasonably argued that these objects were not, in fact, real real numbers. (This is kind of like what I was talking about in the above paragraph. Rather than going directly from a more concrete definition of the real numbers to the Theorem 1, we are just going to prove that certain hypotheses—the axioms—imply Theorem 1, and we can be satisfied with that since we can assume that given any concrete definition, we will be able to derive the axioms.)

The axioms we will use mostly consist of elementary algebraic facts that you learn in school, such as the fact that adding 0 to a number doesn’t change it, or the fact that the product of two positive numbers is positive. I assume the reader is familiar with these facts. The only axiom which will be new to you, unless you’ve taken a course in real analysis, is the axiom of completeness. The axiom of completeness is a bit complicated, so I won’t tell you what it is here; instead, I’ll just assume a weaker form of the axiom called the axiom of Archimedes. The axiom of Archimedes is simply the statement that no real number is greater than every integer. The axiom of completeness implies the axiom of Archimedes, but the axiom of Archimedes does not imply the axiom of completeness. However, the axiom of Archimedes is all we need to prove Theorem 1; we won’t need to use the full axiom of completeness.

Now we know, more or less, exactly what we’re trying to prove. It’s always very important to make sure you know this before trying to make a proof, otherwise the attempt will be hopeless. Let’s finally make a start on the proof itself. The statement to be proven is that there is a rational number between every pair of distinct real numbers. The natural way to carry out the proof of a statement like this is to show how, given two distinct real numbers $a$ and $b$, we can find a rational number $r$ between $a$ and $b$. For example, we might be able to find a formula for $r$ in terms of $a$ and $b$. It’s not immediately obvious how we might find such a formula, though. It’s easy to give a formula for a real number between $a$ and $b$; for example, $(a + b) / 2$ is a real number between $a$ and $b$. The problem is in making sure that the number is rational. There aren’t any ways of combining arbitrary real numbers using simple arithmetic operations that will ensure that the result is rational.

Let’s think about what it means for a number to be rational. It means that it can be expressed as a fraction with an integral numerator and an integral denominator. In other words, $r$ is rational if and only if there are two integers $m$ and $n$ such that $r = m / n$. In fact, because we can cancel out the common factors of $m$ and $n$ if they have any, and we can transfer a minus sign in $n$ up to $m$, we can express every rational number as a fraction whose numerator and denominator have no common factors and whose denominator is positive. So what we’re looking for is a number of the form $m / n$, where $m$ and $n$ are integers with no common factors and $n$ is positive, which is between $a$ and $b$.

As $a$ and $b$ are distinct, one of them must be greater than the other. Let’s assume that $b$ is the greater one. This assumption does not limit us in any way; if $b$ is actually the smaller one, then we can just swap the names of $a$ and $b$. Then, if $m / n$ is to be between $a$ and $b$, that means that $a < m / n < b$. Now, it’s always convenient to get rid of divisions if we can help it. And since $n$ is positive, we can multiply the parts of any inequality by $n$, and the inequality will still hold; in fact, it will also imply the old one. So, the inequality $a < m / n < b$ is equivalent to the inequality $n a < m < n b$. Hence, in order to prove Theorem 1, we can prove the following equivalent theorem (which we’ll call a lemma, actually, since that’s the usual term for theorems you are only interested in so that you can prove other theorems with them).

Lemma 1. For every pair of distinct real numbers $a$ and $b$, there is an integer between $n a$ and $n b$ for some positive integer $n$.

The difference between Theorem 1 and Lemma 1 is minor; the two are completely logically equivalent statements; and yet Lemma 1 a lot easier to think about, visually. If you had never been told that there is a rational number between every pair of distinct real numbers before, you’d probably be somewhat unsure if it was true. But if you think about what Lemma 1 means, it isn’t too difficult to see that it’s true. Think about the distance between $n a$ and $n b$ on the number line, given an arbitrary positive integer $n$. This is $n b - n a$ (using the fact that $b$ is greater than $a$). But this can also be written as $n (b - a)$. So it’s the distance between $a$ and $b$, enlarged by a factor of $n$. Now, all the statement requires is that there is an integer somewhere in the $n (b - a)$-width interval between $n a$ and $n b$ for some positive integer $n$. In particular, it doesn’t matter how big $n$ needs to be. So the question becomes: is it possible to enlarge the gap between $a$ and $b$ so much that there is at least one integer within the enlarged gap, if we restrict ourselves to integral scale factors, but allow the scale factor to be as large as we like? Well, if the distance between $n a$ and $n b$ (where $n$ is the scale factor) is greater than 1, there will have to be an integer in the interval between $n a$ and $n b$. And the distance between $n a$ and $n b$ is proportional to $n$. So, if we can make $n$ as large as we like, we can make the distance between $n a$ and $n b$ as large as we like; in particular we can make it greater than to 1. And then we’re done! (Note that I’ve tacitly relied on the axiom of Archimedes here—can you see why? If not, I’ll explain how shortly.)

At this point, I’m pretty much convinced that Theorem 1 can be proven, but what I said above wasn’t really up to the usual standards for a proof, it was more of an intuitive justification. We need to state precisely how Theorem 1 follows from the axioms. The chief weak point in the reasoning above is the statement that as long as the distance between $n a$ and $n b$ is greater than to 1, there is an integer in the interval between $n a$ and $n b$. It probably seems pretty obvious that this is true; in general, for every pair of real numbers $x$ and $y$ such that $x < y$ and $y - x > 1$, there is an integer $k$ such that $x < k < y$, or so you would think based on intuition. But I can’t immediately see how it follows from the axioms. So let’s try and prove it. The first thing that occurs to me is that we can easily find an integer $k$ such that $x < k$. Because the existence of such an integer is more or less exactly what the axiom of Archimedes asserts. If there was no such integer, then $x$ would be greater or equal to every integer, and hence $x + 1$ would be greater than every integer, but the axiom of Archimedes states that no real number is greater than every integer. So the question is, of these integers $k$ such that $x < k$, how can we find one which is less than $y$? Well, we’re pretty sure that what we’re trying to prove is true, so there must be at least one integer $k'$ which is greater than $x$ but less than $y$. Therefore, the least integer $k$ such that $x < k$ must also be less than $y$, because it is less than $k'$. So, if we let $k'$ be this least integer, we should be able to prove that $k' < y$, somehow. But how? A useful thing to do in this kind of situation is to remember what assumptions we have made. If the statement you’re trying to prove can be false if one your assumptions does not hold, you’re going to have to make use of that assumption in the proof; otherwise, your proof would work just as well at proving the statement without making that assumption, yet the statement is not true without that assumption. So we’re necessarily going to have to use the fact that $k'$ is the least integer $k$ such that $x < k$, because there are integers $k$ such that $x < k$ but $y \le k$ (since the axiom of Archimedes ensures that there are integers greater than $y$, and $y$ is greater than $x$). And we’re going to have to use the fact that $y - x > 1$ at some point, of course. Any ideas? Here’s something: we can rearrange the inequality $y - x > 1$ to get $y > x + 1$. Now, let’s suppose that, counter to what we suspect is true, $y \le k$. Then, since $x + 1$ is less than $y$, we have $x < k$ as well. Rearranging this, we have $x < k - 1$. But if $x < k - 1$, then $k$ is not the least integer $k$ such that $x < k$. So our supposition that $y \le k$ leads to an inevitable contradiction. We can therefore conclude that, in fact, $k < y$, which is what we needed to prove.

So, now let’s return to Lemma 1 and prove it with greater rigour. Remember, we need to find a positive integer $n$ such that there is an integer between $n a$ and $n b$. The first thing we need to do is to decide what value of $n$ to use. We want the distance between $n a$ and $n b$, $n (b - a)$, to be greater than 1, i.e. $n (b - a) > 1$. Dividing both sides of this inequality by $b - a$ (which we can do because $a < b$, so $b - a$ is positive) we get $n > 1 / (b - a)$. Can we find a positive integer $n$ greater than $1 / (b - a)$? Yes, of course we can, by the axiom of Archimedes. The axiom of Archimedes ensures that there is an integer greater than $1 / (b - a)$; it doesn’t say anything about whether it’s positive; but since $1 / (b - a)$ is positive, it has to be positive. So we have our value of $n$. Of course we don’t know exactly what it is, but we know that a value which will work exists. Now, since the distance between $n a$ and $n b$ is greater than 1, we can immediately conclude that there is an integer between $n a$ and $n b$ by what we proved above. This completes the proof of Lemma 1, and thereby Theorem 1.

Just to be totally sure, let’s write the proof out, textbook style.

Theorem. For every pair of distinct real numbers $a$ and $b$, there is a rational number $r$ such that $a < r < b$.

Proof. Suppose $a$ and $b$ are real numbers and $a < b$. Since $b - a$ is positive, $1 / (b - a)$ exists, and in fact is positive, and by the axiom of Archimedes there is an integer $n$ such that $1 / (b - a) < n$. Since $1 / (b - a)$ is positive, $n$ has to be positive. Also, since $b - a$ is positive, we can multiply both sides of the inequality by $b - a$ to get $1 < n (b - a) = n b - n a$. It follows that there is an integer $m$ such that $n a < m < n b$. Since $n$ is positive, we can divide each part of this inequality by $n$ to get $a < m / n < b$. As $m$ and $n$ are both integers, $m / n$ is rational.

So, in the end, the proof is just 7 sentences. If you followed this through to the end, you can probably see what I mean about the textbook proofs making things look a lot simpler. And even the description I’ve given in this post has omitted some of the messier details of the process. For example, as I wrote it I made occasional minor errors, mainly to do with strict and non-strict inequalities (for example, I originally only proved that there was an integer between any two real numbers separated by a distance greater than or equal to to 1, which is true, but not sufficient to imply Theorem 1 as the integer might be equal to one of the two real numbers). I opted to correct those rather than leaving them in and making things more confusing. So, hopefully, this has given you some insight into how mathematicians come up with proofs, if you didn’t know much about it already. Bear in mind, though, that this is a pretty straightforward undergraduate-level theorem. Professional mathematicians have to deal with proofs that are much, much, harder and more complicated.

## Large and small sets

For some infinite sets ${S = \{x_1, x_2, x_3, \dotsc\}}$ of positive integers, the sum of the reciprocals of the members of ${S}$, i.e.

$\displaystyle \frac 1 {x_1} + \frac 1 {x_2} + \frac 1 {x_3} + \dotsb,$

diverges. Such sets are said to be large. For example, ${\mathbb N}$, the set of all the positive integers, is large, because the harmonic series diverges. On the other hand, for some infinite sets ${S}$ the sum of the reciprocals of the members of ${S}$ converges instead; such sets are said to be small. For example, for every integer ${p}$ greater than 1, the set of all the positive integral ${p}$th powers is small, because the series

$\displaystyle 1 + \frac 1 {2^p} + \frac 1 {3^p} + \dotsb$

converges. There is thus an interesting distinction between two different sizes of infinite sets of positive integers. Let’s have a look at some other infinite sets of positive integers and see whether they are large or small.

For every positive integer ${m}$, the set of all the positive integral multiples of ${m}$ is easily seen to be large, due to the fact that the harmonic series diverges, and dividing the harmonic series by ${m}$ yields the series

$\displaystyle \frac 1 m + \frac 1 {2 m} + \frac 1 {3 m} + \dotsb.$

In fact, this proof can easily be generalised to show that for every set ${S}$ of positive integers and every positive integer ${m}$, ${\{m n : n \in S\}}$ is large if and only if ${S}$ is large.

What about the set ${S}$ of all the positive integers which are not multiples of ${m}$? The sum of the reciprocals of the members of ${S}$ can be written as the difference

$\displaystyle \left( 1 + \frac 1 2 + \frac 1 3 + \dotsb \right) - \left( \frac 1 m + \frac 1 {2 m} + \frac 1 {3 m} + \dotsb \right).$

The minuend and subtrahend of this difference are both equal to ${\infty}$, so the sum of the reciprocals of the members of ${S}$ cannot be immediately seen from writing it like this. However, if we take out the factor of ${1 / m}$ from the subtrahend, we get

$\displaystyle \left( 1 + \frac 1 2 + \frac 1 3 + \dotsb \right) - \frac 1 m \left( 1 + \frac 1 2 + \frac 1 3 + \dotsb \right),$

and then we can remove the common factor of ${1 + 1 / 2 + 1 / 3 + \dotsb}$ to get

$\displaystyle \left( 1 + \frac 1 2 + \frac 1 3 + \dotsb \right) \left( 1 - \frac 1 m \right).$

It is easy to see that the value of this expression is also ${\infty}$. So ${S}$ is large, as well.

In fact, for every finite set of pairwise coprime positive integers ${p_1}$, ${p_2}$, … and ${p_{\ell}}$, the set ${S}$ of all the positive integers which are not multiples of any of these integers is large, too. This is because, if we let ${H}$ be the value of the harmonic series, then, using the inclusion-exclusion principle, the sum of the reciprocals of the members of ${S}$ can be written as

$\displaystyle H - \sum_{k} \frac H k + \sum_{k_1, k_2} \frac H {k_1 k_2} - \dotsb + \frac {(-1)^{\ell} H} {k_1 k_2 \dotsm k_{\ell}},$

in which, for every integer $j$ such that $1 \le j \le \ell$, the $j$th term is a sum over all the finite sets of $j$ integers $k$ such that $1 \le k \le \ell$. By factorising this expression, we can rewrite it as

$\displaystyle H \left( 1 - \frac 1 {p_1} \right) \left( 1 - \frac 1 {p_2} \right) \dotsm \left( 1 - \frac 1 {p_{\ell}} \right),$

and it is easy to see that the value of this expression is ${\infty}$.

Generalising from this, one might expect that if we let ${P = \{p_1, p_2, p_3, \dotsc\}}$ be the prime numbers, so that the members of ${P}$ are pairwise coprime and every integer is a multiple of a member of ${P}$, then

$\displaystyle H \left( 1 - \frac 1 {p_1} \right) \left( 1 - \frac 1 {p_2} \right) \left( 1 - \frac 1 {p_3} \right) \dotsm = 1. \ \ \ \ \ (1)$

To prove this rigorously, all that is needed is to note that for every positive integer ${n}$,

$\displaystyle H \left( 1 - \frac 1 {p_1} \right) \left( 1 - \frac 1 {p_2} \right) \dotsm \left( 1 - \frac 1 {p_n} \right) = 1 + K,$

where ${K}$ is the sum of the reciprocals of the integers greater than 1 which are not multiples of any of the first ${n}$ prime numbers. The least such integer is ${p_{n + 1}}$, so ${K}$ is smaller than the sum of the reciprocals of the integers greater than or equal to ${p_{n + 1}}$. Therefore,

$\displaystyle H \left( 1 - \frac 1 {p_1} \right) \left( 1 - \frac 1 {p_2} \right) \dotsm \left( 1 - \frac 1 {p_n} \right) < 1 + \frac 1 {p_{n + 1}} + \frac 1 {p_{n + 1} + 1} + \frac 1 {p_{n + 1} + 2} + \dotsb.$

As ${n}$ tends towards positive infinity, the sum of the reciprocals of the integers greater than or equal to ${p_{n + 1}}$ tends towards 0, because it can be written as ${H - H_{p_{n + 1} - 1}}$, which is less than ${H - H_n}$, and by definition ${H_n}$ tends towards ${H}$ as ${n}$ tends towards positive infinity. So the right-hand side of the inequality above tends towards 1, and an application of the squeeze theorem completes the proof.

Rearranging (1) to solve for ${H}$ yields

$\displaystyle H = \left( \frac 1 {1 - 1 / p_1} \right) \left( \frac 1 {1 - 1 / p_2} \right) \left( \frac 1 {1 - 1 / p_3} \right) \dotsm,$

so we now have a way of expressing the harmonic series as an infinite product. This is quite interesting in its own right, but it can also be used to prove a rather surprising result. By taking the natural logarithm of both sides, we see that

$\displaystyle \ln H = \ln \left( \frac 1 {1 - 1 / p_1} \right) + \ln \left( \frac 1 {1 - 1 / p_2} \right) + \ln \left( \frac 1 {1 - 1 / p_3} \right) + \dotsb.$

The natural logarithm of ${\infty}$ is ${\infty}$, so this means the series on the right-hand side diverges. For every positive integer ${n}$, the ${n}$th term of the series on the right-hand side, ${\ln (1 / (1 - 1 / p_n))}$, can also be written as ${\ln (1 + 1 / (p_n - 1))}$, and this number is less than or equal to ${1 / (p_n - 1)}$ (in general, for every real number ${x}$ greater than ${-1}$, ${\ln (1 + x) \le x}$). If ${n}$ is greater than 1, then this number, in turn, is less than or equal to ${1 / p_{n - 1}}$, because ${p_n - 1}$ is greater than or equal to ${p_{n - 1}}$. And ${1 / (2 - 1)}$, i.e. 1, is less than or equal to 1. So, by the comparison test, the series

$\displaystyle 1 + \frac 1 {p_1} + \frac 1 {p_2} + \frac 1 {p_3} + \dotsb$

diverges, too. This shows that the set consisting of 1 as well as all the prime numbers is large, but it is then an immediate consequence that the set of all the prime numbers, not including 1, is large (since we can subtract 1 from the above series and the value of the result will still have to be ${\infty}$).

This is a rather interesting and surprising result. The set of all the prime numbers is large, but for every positive integer ${p}$, the set of all the positive integral ${p}$th powers is small, which means that, in a certain sense, there are more prime numbers than positive integral ${p}$th powers.

## How to calculate cube roots in your head

Here’s a way to impress people. Ask somebody to think of a number with no more than 3 digits and tell you its cube (they can use a calculator to do this). Using the method I’m about to tell you about, you can calculate the original number, the cube root of the number you’re given, without needing to carry out any intermediate calculations other than addition and subtraction of numbers with no more than 2 digits. If you can carry out these intermediate calculations at a normal speed, you will probably be able to calculate the cube root within a minute, with some practice.

Suppose the number you’re given is ${n^3}$ (so that ${n}$ is the original number, the cube root you’re trying to find). You will know that ${n}$ has at most 3 digits. If you use this method, you will calculate the three digits of ${n}$ one by one. The initial and final digits can be calculated almost immediately; the middle digit is the one you will spend most of your time calculating, because this is the part of the calculation where you have to carry out some additions and subtractions.

1. The final digit

In order to find the final digit, you can use the fact that for every pair of numbers ${x}$ and ${y}$, ${x}$ and ${y}$ have the same final digit if and only if ${x^3}$ and ${y^3}$ have the same final digit. In other words, the final digit of ${n}$ is predictable from the final digit of ${n^3}$. The following table can be used.

 Final digit of ${n^3}$ 0 1 2 3 4 5 6 7 8 9 Final digit of ${n}$ 0 1 8 7 4 5 6 3 2 9

The table is not difficult to memorise. Just remember it this way: if the final digit of ${n^3}$ is 0, 1, 4, 5, 6, or 9, then the final digit of ${n}$ is the same, and otherwise (if the final digit of ${n^3}$ is 2, 3, 7 or 8), the final digit of ${n}$ is 10 minus the final digit of ${n^3}$.

If you’re not interested in why this works, you can skip ahead a few paragraphs. But in case you are, I will explain why this works. First, note that for every number ${x}$, the final digit of ${x}$ is its remainder on division by 10, i.e. the unique integer ${d}$ such that ${0 \le d < 10}$ and ${x = 10 q + d}$ for some integer ${q}$. For example, ${143 = 140 + 3}$ and ${2764 = 2760 + 4}$. Cubing both sides of the latter equation yields

$\displaystyle \begin{array}{rcl} x^3 &=& (10 q + d)^3 \\ &=& 10^3 q^3 + 3 \cdot 10^2 q^2 d + 3 \cdot 10 q d^2 + d^3 \\ &=& 10 (10^2 q^3 + 3 \cdot 10 q^2 d + 3 \cdot q d^2) + d^3. \end{array}$

Now, if we let ${r}$ be the final digit of ${d^3}$, i.e. the remainder of ${d^3}$ on division by 10, i.e. the unique integer ${r}$ such that ${0 \le r < 10}$ and ${d^3 = 10 p + r}$ for some integer ${q}$, we then have

$\displaystyle \begin{array}{rcl} x^3 &=& 10 (10^2 q^3 + 3 \cdot 10 q^2 d + 3 \cdot q d^2) + 10 p + r \\ &=& 10 (10^2 q^3 + 3 \cdot 10 q^2 d + 3 \cdot q d^2 + p) + r, \end{array}$

from which we can see that ${r}$ is also the remainder of ${x^3}$ on division by 10, i.e. the final digit of ${x^3}$. This shows that for every number ${x}$, the final digit of ${x^3}$ is the final digit of ${d^3}$, where ${d}$ is the final digit of ${x}$. Let’s make a table showing what the final digit of ${d^3}$ is for each of the 10 possible values of ${d}$. The final digits of each of the numbers in the second row have been bolded.

 ${d}$ 0 1 2 3 4 5 6 7 8 9 ${d^3}$ 0 1 8 27 64 125 216 343 512 729

Now, returning to the task at hand, we know the final digit of ${n^3}$, which is also the final digit of ${d^3}$, where ${d}$ is the final digit of ${n}$. Let’s call this number ${d'}$. From the table above, we can see that if ${d}$ is 1, ${d'}$ is 1; if ${d}$ is 3, ${d'}$ is 7; if ${d}$ is 6, ${d'}$ is 6; if ${d}$ is 8, ${d'}$ is 2, and so on. In fact, for every possible value of ${d'}$ there is only one possible value of ${d}$, and therefore we can read off the value of ${d}$ by finding the unique cell in the bottom row containing a number whose final digit is ${d'}$, and seeing which number is in the cell above. This is where the original table comes from.

This proof crucially relies on the fact that for every possible value of ${d'}$ there is only one possible value of ${d}$. If you try to do the same thing with, say, square roots, you find that there are some possible values of ${d'}$ for which there are multiple possible values of ${d}$, and therefore, while this method can narrow down the possible values of ${d}$ it cannot give you the exact value.

2. The initial digit

In order to find the initial digit, have a look at the table above which shows the cubes of each of the non-negative integers less than 10. You will need to memorise this table, unfortunately, and there is no simple summary for it as there is with the other table. However, knowing the first 10 non-negative perfect cubes is useful in a lot of situations where you need to make mental calculations, not just in this situation.

First of all, you need to write ${n^3}$ as a 9-digit number. That means, if it has less than 9 digits, put 0s in front of it until it has 9 digits. Now, look at the first 3 digits of ${n^3}$, when it is written like this; this gives you a 3-digit number. Using your knowledge of the first 10 non-negative perfect cubes, identify the greatest perfect cube which is no greater than this 3-digit number. The first digit of ${n}$ is the cube root of this perfect cube. You can also use the following table.

 First 3 digits of ${n^3}$ 000 001–007 008–026 027–063 064–124 First digit of ${n}$ 0 1 2 3 4

 First 3 digits of ${n^3}$ 125–215 216–342 343–511 512–728 729–999 First digit of ${n}$ 5 6 7 8 9

To see why this works, note that for every number ${x}$ with no more than 3 digits, the first digit of ${x}$ is the unique integer ${d}$ such that ${10^2 d \le x < 10^2 (d + 1)}$, where ${n}$ is the number of digits of ${x}$. For example, ${300 \le 341 < 400}$ and ${100 \le 192 < 200}$. Cubing each part of this inequality yields ${10^6 d^3 \le x^3 \le 10^6 (d + 1)^3}$. For example, ${27000000 \le 341^3 < 64000000}$ and ${1000000 \le 192^3 < 8000000}$. And if an inequality of the form of the latter one is observed to hold, for some value of ${d}$, then the cube roots of each part can be taken to obtain an inequality of the form of the former one, thus showing that the first digit of ${x}$ is ${d}$.

By the way, this part of the method does not rely on the fact that ${n^3}$ is a perfect cube; it will also work for finding the first digit of irrational cube roots. It can also be generalised without complication to find the first digit of roots of any order: square roots, 4th roots, etc.

3. The middle digit

The cleverest part of the method is the calculation of the middle digit. Unlike the initial and final digits, there is no simple characterisation of how the middle digit relates to the whole number. Instead, we take advantage of a rather obscure fact that relates the three digits of ${n}$. The middle digit can therefore only be calculated once the other two digits are known. This obscure fact is the fact that for every number ${x}$, the remainder of ${x}$ on division by 11 is same as the remainder on division by 11 of the alternating sum of the digits of ${x}$. By the alternating sum of the digits of ${x}$, I mean the sum of the digits of ${x}$ at places corresponding to even powers of 10 (the unit place, the hundreds place, the ten thousands place and so on) minus the sum of the digits of ${x}$ at places corresponding to odd powers of 10 (the tens place, the thousands place, the hundred thousands place and so on). For example, the alternating sum of the digits of 121281 is ${1 - 8 + 2 - 1 + 2 - 1 = -5}$, and, sure enough, we have ${121281 = 11 \cdot 11025 + 6}$, while ${-5 = -11 + 6}$, so the remainders of 121281 and -5 on division by 11 are the same. In practice, to calculate the alternating sum of ${x}$, you look at each digit of ${x}$ in turn, going from right to left (not the order the digits are written in, which is left to right!) and alternate between adding and subtracting the digit from the total; that’s why it’s called an alternating sum. It can be helpful to take the digits in groups of two and find the difference between the two digits in the group first, then add it to the total; that way, you are less likely to get confused since you do the same thing at each step. For example, to calculate the alternating sum of 121281, you see that ${1 - 8 = -7}$, and set the total at ${-7}$, then you see that ${2 - 1 = 1}$, and set the total at ${-6}$, and finally you see that ${2 - 1 = 1}$ and set the total at ${-5}$. By the way, if the alternating sum is non-negative and less than 11 then it is equal to its remainder on division by 11; otherwise, although it will often be small enough that you can immediately see its remainder on division by 11, you could find the alternating sum of the digits of the alternating sum, repeating the process, to get an even smaller number.

If you just want to see how to calculate the middle digit, skip ahead, but you might be wondering why these two remainders are the same, so I will give you a proof. Suppose ${x}$ is an integer with ${n}$ digits. Let ${d_0}$ be its unit digit, let ${d_1}$ be its tens digit, … and let ${d_n}$ be its ${10^n}$s digit, so that

$\displaystyle x = d_0 + 10 d_1 + 10^2 d_2 + \dotsb + 10^n d_n.$

We can also write this as

$\displaystyle x = d_0 + (11 - 1) d_1 + (11 - 1)^2 d_2 + \dotsb + (11 - 1)^n d_n.$

For every integer ${k}$ such that ${0 \le k \le n}$, ${(11 - 1)^k}$ expands into a sum that can be written as ${11 q_k + (-1)^k}$ for some integer ${q_k}$ (because every term in the sum other than ${(-1)^k}$ has at least one factor of 11). Therefore, we have

$\displaystyle \begin{array}{rcl} x &=& d_0 + (11 - 1) d_1 + (11 \cdot 9 + 1) d_2 + \dotsb + (11 q_k + (-1)^k) d_n \\ &=& 11 (d_1 + 9 d_2 + \dotsb + q_k d_n) + d_0 - d_1 + d_2 + \dotsb + (-1)^k d_n, \end{array}$

from which we can see that the remainder of ${d_0 - d_1 + d_2 + \dotsb + (-1)^k d_n}$ on division by 11 will be the same as the remainder of ${x}$ on division by 11 (by the same logic we used above to show that ${x^3}$ and ${d^3}$ had the same remainders on division by 10). This completes the proof.

So, how do you use this fact to calculate the middle digit? Well, first note that for every number ${x}$, if we let ${r}$ be the remainder of ${x}$ on division by 11, then ${r^3}$ is the remainder of ${x}$ on division by 11. We proved this above for division by 10 rather than division by 11, and the proof for division by 11 is pretty much exactly the same. Just replace references to 10 by references to 11. So I won’t go through the proof in detail here. Instead, we can get right awya to seeing what the remainder of ${r^3}$ on division by 11 is for each of the 11 possible values of ${r}$. In the following table, the terms ‘quotient of ${r^3}$’ and ‘remainder of ${r^3}$’ should be taken to refer to the quotient and remainder on division by 11; there just isn’t room to give a full description inside the table.

 ${r}$ 0 1 2 3 4 5 6 7 8 9 10 ${r^3}$ 0 1 8 27 64 125 216 343 512 729 1000 Quotient of ${r^3}$ 0 0 0 2 5 11 19 31 46 66 90 Remainder of ${r^3}$ 0 1 8 5 9 4 7 2 6 3 10

Again, it happens that this table has the convenient property that for every possible value of ${r^3}$ there is only one possible value of ${r}$, and therefore we can use it to find the value of ${r}$ given the value of ${r^3}$. Or, more conveniently, we can use the following inverted form of the table.

 ${r^3}$ 0 1 2 3 4 5 6 7 9 10 ${r}$ 0 1 7 9 5 3 8 6 2 4 10

Unfortunately, there is no simple way to summarise the contents of this table as there was with the earlier one. It’s just an more or less random correspondance which you have to memorise. This is why you need a bit of practice to be able to do the calculation quickly. But if you put your mind to it and test yourself at regular intervals over a couple of days, you’ll probably be able to successfully memorise the correspondance.

So, in order to find the middle digit, you find the remainder of ${r^3}$ on division by 11 by calculating the alternating sum of the digits of ${n^3}$ (and taking the remainder on division by 11 of this alternating sum, although often it will be non-negative and less than 11, in which case it is the same as its remainder). Then you use the table above to find the remainder of ${r}$ upon division by 11. Bearing in mind the fact we stated above about remainders upon division by 11, perhaps you can now see what to do. The alternating sum of the digits of ${n}$ is ${D - m + d}$, where ${D}$ is the initial digit of ${n}$, ${m}$ is the middle digit of ${n}$ and ${d}$ is the final digit of ${n}$. You can work out the values of ${D}$ and ${d}$, and you also know that the remainder on division by 11 of this alternating sum is ${r}$. At this point you might be able to immediately see what ${m}$ has to be, but its value can be calculated systematically: we have ${D - m + d = 11 q + r}$ for some integer ${q}$, so ${D + d - r = 11 q + m}$, which shows that ${m}$ is the remainder on division by 11 of ${D + d - r}$.

This is a little complicated, so an example will be helpful. Suppose ${n^3 = 257259456}$. The final digit of ${n}$ must also be 6 (using the table), and the initial digit of ${n}$ must be 6 since ${6^2 = 216 \le 257 < 343}$. The next step is to calculate the alternating sum of the digits of ${n^3}$: ${(6 - 5) + (4 - 9) + (5 - 2) + (7 - 5) + 2 = 1 + (-5) + 3 + 2 + 2 = 3}$. From the table it follows that the remainder of ${n}$ upon division by 11 is 9. Therefore, the remainder on division by 11 of ${6 - m + 6}$, i.e. ${12 - m}$, where ${m}$ is the middle digit of ${n}$, must be 9. It is then easy to see that ${m}$ must be 3, so ${n = 636}$. The last step could also be done by noting that ${6 + 6 - 9 = 3}$. Try using a calculator to confirm for yourself that ${636^3 = 257259456}$.

4. Summary

I’ve explained everything at great length here, so it might seem like this method is very complicated, but it isn’t really. Here’s a step-by-step presentation of the method, with rough estimates of how long it will take for you to do each step.

1. Look at the final digit of ${n^3}$ and use your knowledge of the correspondance to find the final digit ${d}$ of ${n}$. (If ${n^3}$ ends in 2, 3, 7 or 8, ${d}$ is 10 minus the final digit of ${n^3}$, and otherwise ${n}$ ends in the same digit as ${n^3}$.)
2. Find the first 3 digits of ${n^3}$ as a 9-digit number. Use your knowledge of the first ten non-negative perfect cubes and find the greatest integer ${D}$ such that ${D^3}$ is less than the resulting 3-digit number. The first digit is ${D}$.
3. Find the alternating sum of the digits of ${n^3}$, and if needed repeat the process to get the remainder of ${n^3}$ on division by 11.
4. Use your knowledge of the correspondance to find the remainder ${r}$ on division by 11 of ${n}$.
5. Calculate ${D + d - r}$ and find its remainder ${m}$ on division by 11. The middle digit is ${m}$.

If you have everything memorised, I think step 1 will take at most 2 seconds, step 2 will take perhaps a little longer, but still at most 5 seconds, step 3 will take the majority of your time, at most 30 seconds, with the potential to last much longer if you get confused, step 4 will take at most 2 seconds, and step 5 will take at most 10 seconds. That’s why I can confidently say that you will be able to calculate the cube root within a minute.

I learnt about this from this post at Dead Reckonings. If you are interested in this, have a look at that post— it covers a lot of other mental calculation methods in a more concise manner than this post. I don’t know about you, but until I came across that post I always used to associate the topic of mental calculation with tedious execution of algorithms and had therefore considered it uninteresting and not really `real maths’. But as you acquire more mathematical knowledge, you find that calculations end up becoming a sort of creative process, where you often don’t stick to a single method but improvise a way of tackling the problem, drawing on what you know, so that it can be solved as quickly as possible. It’s also interesting to see why these algorithms work and think of how one might come up with them. If you’re interested you could try coming up with similar algorithms for other kinds of roots. Or, perhaps, see if there’s a way of finding cube roots of numbers with more than 9 digits (though I have no idea how you might do that).