Pure and mixed strategies in prediction

Consider the following very simple game: a Bernoulli trial (a trial which results in one of two possible outcomes, labelled “success” and “failure”) is carried out with success probability p. Beforehand, you are told the value of p and asked to give a definite prediction of the trial’s outcome. That is, you have to predict either success or failure; just saying “the probability of success is p” is not enough. You win if and only if you predict the correct outcome.

Here are two reasonable-sounding strategies for this game:

  1. If p > 0.5, predict success. If p < 0.5, predict failure. If p = 0.5, predict success with probability 0.5 and failure with probability 0.5.
  2. Predict success with probability p and failure with probability 1 - p.

In game-theoretic language, the difference between strategies 1 and 2 is that strategy 1 involves the use of a pure strategy if possible, i.e. one in which the choice of what to predict is made deterministically, while strategy 2 is always mixed, i.e. the choice of what to predict is made randomly.

But which is better? Note that the answer may depend on the value of p. Try to think about it for a minute before moving on to the next paragraph.

If p = 0.5, then the strategies are identical and therefore both equally good.

If p \ne 0.5, let q be the probability of the more probable outcome (i.e. p if p > 0.5 and 1 - p if p < 0.5). If the more probable outcome happens, then you win for sure under strategy 1 but you only have probability q of winning under strategy 2. If the less probable outcome happens, then you lose for sure under strategy 1 but you still have probability 1 - q of winning under strategy 2. Therefore the probability of winning is q \cdot 1 + (1 - q) \cdot 0 = q under strategy 1 and q \cdot q + (1 - q) \cdot (1 - q) = 1 - 2q(1 - q) under strategy 2. So strategy 1 is better than strategy 2 if and only if

\displaystyle q > 1 - 2q(1 - q),


\displaystyle 3q - 2q^2 - 1 > 0.

This quadratic inequality holds if and only if 0.5 < q < 1. But q is the probability of the more probable outcome, and therefore q > 0.5 for sure. Therefore, strategy 1 is always better if p \ne 0.5.

I find this result weird and a little counterintuitive when it’s stated so abstractly. It seems to me like the most natural way of obtaining a definite value from the distribution—drawing randomly from the distribution—should be the best one.

But I guess it does makes sense, if you think about it as applying to a concrete situation. For example, if you were on a jury and you thought there was a 1/1024 probability that the defendant was guilty, it would be crazy to then flip 10 coins and precommit to arguing for the defendant’s guilt if every one of them came up heads. The other jurors would think you were mad (and probably be very angry with you, if they did all come up heads).

The result has interesting implications for how people should act on their beliefs. If you believe that degrees of belief can be usefully modelled as probabilities, and you try to apply this in everyday reasoning, you will often be faced with the problem of deciding whether to act in accordance with a belief’s truth even if you only place a certain probability p on that belief being true. Should you always act in accordance with the belief if p > 0.5, or should you have probability p of acting in accordance with it at any given time? Until I wrote this post it wasn’t obvious to me, but the result in this post suggests you should do the former.

I do wonder if there is anything strategy 2 is good for, though. Comment if you have an idea!

Generating random probability distributions

I’ve been thinking about how to use a computer program to randomly generate a probability distribution of a given finite size. This has turned out to be an interesting problem.

My first idea was to generate n − 1 uniform variates in [0, 1] (where n is the desired size), sort them, add 0 to the front of the list and 1 to the back of the list, and then take the non-negative differences between the adjacent variates. In Python code:

def randpd1(size):
    variates = [random.random() for i in range(size - 1)]
    return [j - i for i, j in zip([0] + variates, variates + [1])]

My second idea was to simply generate n uniform variates in [0, 1], add them together, and take the ratio of each individual variate to the sum.

def randpd2(size):
    variates = [random.random() for i in range(size)]
    s = sum(variates)
    return [i/s for i in variates]

Both of these functions do reliably generate probability distributions, i.e. lists of non-negative real numbers (encoded as Python float objects) that sum to 1, although very large lists generated by randpd2 sometimes sum to something slightly different from 1 due to floating-point imprecision.

>>> sample1 = [randpd1(2**(i//1000 + 1)) for i in range(10000)]
>>> all(all(p >= 0 for p in pd) for pd in sample1)
>>> all(sum(pd) == 1 for pd in sample1)
>>> sample2 = [randpd2(2**(i//1000 + 1)) for i in range(10000)]
>>> all(all(p >= 0 for p in pd) for pd in sample2)
>>> all(sum(pd) == 1 for pd in sample2)
>>> all(math.isclose(sum(pd), 1) for pd in sample2)

But do they both generate a random probability distribution? In precise terms: for a given size argument, are the probability distributions of the return values randpd1(size) and randpd2(size) always uniform?

I don’t really know how to answer this question. In fact, it’s not even clear to me that there is a uniform distribution over the probability distributions of size n for every positive integer n. The problem is that the probability distributions of size n are the solutions in [0, 1]^n of the equation p_1 + p_2 + \dotsb + p_n = 1, where p_1, p_2, … and p_n are dummy variables, and therefore they comprise a set S whose dimension is n − 1 (not n). Because S is missing a dimension, continuous probability distributions over it cannot be defined in the usual way via probability density mappings on \mathbb R^n. Any such mapping would have to assign probability density 0 to every point in S, because for every such point x, there’s a whole additional dimension’s worth of points in every neighbourhood of x which are not in S. But then the integral of the probability density mapping over S would be 0, not 1, and it would not be a probability density mapping.

But perhaps you can map S onto a subset of \mathbb R^{n - 1}, and do something with a uniform distribution over the image. In any case, I’m finding thinking about this very confusing, so I’ll leave it for readers to ponder over. Given that I don’t currently know what a uniform probability distribution over the probability distributions of size n even looks like, I don’t know how to test whether one exists.

I can look at the marginal distributions of the individual items in the returned values of randpd1 and randpd2. But these marginal distributions are not straightforwardly related to the joint distribution of the list as a whole. In particular, uniformity of the joint distribution does not imply uniformity of the marginal distributions, and uniformity of the marginal distributions does not imply uniformity of the joint distribution.

But it’s still interesting to look at the marginal distributions. First of all, they allow validation of another desirable property of the two functions: the marginal distributions are the same for each item (regardless of its position in the list). I’m not going to demonstrate this here because it would be tedious, but it does look like this is the case. Therefore we can speak of “the marginal distribution” without reference to any particular item. Second, they reveal that randpd1 and randpd2 do not do exactly the same thing. The marginal distributions are different for the two functions. Let’s first look just at the case where size is 2.

>>> data1 = [randpd1(2)[0] for i in range(100000)]
>>> plt.hist(data1)


>>> data2 = [randpd2(2)[0] for i in range(100000)]
>>> plt.hist(data2)


The first plot looks like it’s been generated from a uniform distribution over [0, 1]; the second plot looks like it’s been generated from a non-uniform distribution which concentrates the probability density at 1/2. It’s easy to see why the distribution is uniform for randpd1: the function works by generating a single uniform variate p and then returning [min(p, 1 - p), max(p, 1 - p)], and given that the distribution of p is uniform the distribution of 1 - p is also uniform. The function randpd2, on the other hand, works by generating two uniform variates p + q and returning [p/(p + q), q/(p + q)]. However, I don’t know what the distribution of p/(p + q) and q/(p + q) is exactly, given that p and q are uniformly distributed. This is another thing I hope readers who know more about probability and statistics than me might be able to enlighten me on.

Here are the graphs for size 3:

>>> data1 = [randpd1(3)[0] for i in range(100000)]
>>> plt.hist(data1)


>>> data2 = [randpd2(3)[0] for i in range(100000)]
>>> plt.hist(data2)


The marginal distribution for randpd1 is no longer uniform; rather, it’s right triangular, with the right angle at the point (0, 0) and height 2. That means that, roughly speaking, a given item in the list returned by randpd1(3) is twice as likely to be close to 0 as it is to be close to 1/2, and half as likely to be close to 1 as it is to be close to 1/2.

In general, the marginal distribution for randpd1 is the distribution of the minimum of a sample of uniform variates in [0, 1] of size n − 1, where n is the value of size. This is because randpd1 works by generating such a sample, and the minimum of that sample always ends up being the first item in the returned list, and the marginal distributions of the other items are the same as the marginal distribution of the first item.

It turns out to be not too difficult to derive an exact formula for this distribution. For every x \in [0, 1], the minimum is greater than x if and only if all n − 1 variates are greater than x. Therefore the probabilities of these two events are the same. The probability of an individual variate being greater than x is 1 − x (because, given that the variate is uniformly distributed, x is the probability that the variate is less than or equal to x) and therefore, given that the variates are independent of each other, the probability of all being greater than x is (1 - x)^{n - 1}. It follows that the probability of the minimum being less than or equal to x is 1 - (1 - x)^{n - 1}. That is, the cumulative distribution mapping (CDM) f of the marginal distribution for randpd1 is given by

\displaystyle f(x) = 1 - (1 - x)^{n - 1}.

The probability distribution defined by this CDM is a well-known one called the beta distribution with parameters (1, n − 1). That’s a nice result!

The marginal distribution for randpd2, on the other hand, is similar to the one for size 2 except that the mean is now something like 1/3 rather than 1/2, and because the support is still the whole interval [0, 1] this results in a left-skewing of the distribution. Again, I don’t know how to characterize this distribution exactly. Here are the graphs for sizes 4 and 5:

>>> data = [randpd2(4)[0] for i in range(100000)]
>>> plt.hist(data)


>>> data = [randpd2(5)[0] for i in range(100000)]
>>> plt.hist(data2)


It looks like the marginal distribution generally has mean 1/n, or something close to that, for every positive integer n, while still having density approaching 0 at the left limit.

In conclusion… this post doesn’t have a conclusion, it just has a bunch of unanswered questions which I’d like to know the answers to.

  1. Is the concept of a uniform distribution over the probability distributions of size n sensical?
  2. If so, do the returned values of randpd1 and randpd2 have that distribution?
  3. If not, what distributions do they have?
  4. What’s the exact form of the marginal distribution for randpd2?
  5. Which is better: randpd1 or randpd2? Or if one isn’t clearly better than the other, what is each one best suited to being used for?
  6. Are there any other algorithms one could use to generate a random probability distribution?

Modelling communication systems

One of the classes I’m taking this term is about modelling the evolution of communication systems. Everything in the class is done via simulation, which is probably the best way to do it, and certainly necessary at the point where it starts to involve genetic algorithms and such. However, some of the earlier content in the class dealt with problems that I suspected were solvable by a purely mathematical approach, so as somebody with a maths degree I felt it necessary to rise to the challenge and try to derive the solutions mathematically. This post is my attempt to do that.

Let us begin by thinking very abstractly about a system which takes something in and gives something out. Suppose there is a finite, positive number m of things which may be taken in (possible inputs), which we shall call input 1, input 2, … and input m. Suppose likewise that there is a finite, positive number n of things which may be given out (possible outputs), which we shall call output 1, output 2, … and output n.

One way in which the behavior of such a system could be modelled is as a straightforward mapping from inputs to outputs. However, this might be too deterministic: perhaps the system doesn’t always output the same output for a given input. So let’s use a more general model, and think of the system as a mapping from inputs to probability distributions over outputs. For every pair (i, j) of integers such that 0 ≤ im and 0 ≤ jn, let pi, j denote the probability that input i is mapped to output j. The mapping as a whole is determined by the mn probabilities of the form pi, j, and therefore it can be thought of as an m-by-n matrix A:

\displaystyle \mathbf A = \left( \begin{matrix} p_{1, 1} & p_{1, 2} & \hdots & p_{1, n} \\ p_{2, 1} & p_{2, 2} & \hdots & p_{2, n} \\ \vdots & \vdots & \ddots & \vdots \\ p_{m, 1} & p_{m, 2} & \hdots & p_{m, n} \end{matrix} \right).

The rows of A correspond to the possible inputs and the columns of A correspond to the possible outputs. Probabilities are non-negative real numbers, so A is a non-negative real matrix. Also, the probabilities of mutually exclusive, exhaustive outcomes sum to 1, so the sum of each row of A is 1. This condition can be expressed as a system of linear equations:

\displaystyle \begin{aligned} p_{1, 1} &+ p_{1, 2} &+ \hdots &+ p_{1, n} &= 1 \\ p_{2, 1} &+ p_{2, 2} &+ \hdots &+ p_{2, n} &= 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ p_{m, 1} &+ p_{m, 2} &+ \hdots &+ p_{m, n} &= 1. \end{aligned}

Alternatively, and more compactly, it may be expressed as the matrix equation

\displaystyle (1) \quad \mathbf A \mathbf x = \mathbf y,

where x is the n-dimensional vector whose components are all equal to 1 and y is the m-dimensional vector whose components are all equal to 1.

In general, if x is an n-dimensional vector, and we think of x as a random variable determined by the output of the system, then Ax is the vector of expected values of x conditional on each input. That is, for every integer i such that 1 ≤ im, the ith component of Ax is the expected value of x conditional on meaning i being the input to the system.

Accordingly, if we have not just one, but p n-dimensional vectors x1, x2, … and xp (where p is a positive integer), we can think of these p vectors as the columns of an n-by-p matrix B, and then we can read off all the expected values from the matrix product

\displaystyle \mathbf A \mathbf B = \mathbf A \mathbf x_1 + \mathbf A \mathbf x_2 + \dotsb + \mathbf A \mathbf x_n

like so: for every pair (i, k) of integers such that 0 ≤ im and 0 ≤ kp, the (i, k) entry of AB is the expected value of xk conditional on meaning i being the input to the system.

In the case where B happens to be another non-negative real matrix such that

\displaystyle \mathbf B \mathbf x = \mathbf y,

so that the entries of B can be interpreted as probabilities, the matrix B as a whole can be interpreted as another input-output system whose possible inputs happen to be the same as the possible outputs of A. In order to emphasize this identity, let us now call the possible outputs of A (= the possible inputs of B) the signals: signal 1, signal 2, … and signal n. The other things—the possible inputs of A, and the possible outputs of B—can be thought of as meanings. Note that there is no need at the moment for the input meanings (the possible inputs of A) to be the same as the output meanings (the possible outputs of B); we make a distinction between the input meanings and the output meanings.

Together, A and B can be thought of as comprising a “product system” which works like this: an input meaning goes into A, a signal comes out of A, the signal goes into B, and an output meaning comes out of B. For every integer k such that 0 ≤ kp, the random variable xk (the kth column of B) can now be interpreted as the probability of the product system outputting output meaning k, as a random variable whose value is determined by the signal. That is, for every integer j such that 0 ≤ jn, the jth component of xk (the (j, k) entry of B) is the probability of output meaning k coming out if the signal happens to be signal j. It follows by the law of total probability that the probability of output meaning k coming out, if i is the input meaning, is the expected value of xk conditional on i being the input meaning. Now, by what we said a couple of paragraphs above, we have that for every integer i such that 0 ≤ im, the expected value of xk conditional on i being the input meaning is the (i, k) entry of AB. So the “product system”, as a matrix, is the matrix product AB. That’s why we call it the “product system”, see? 🙂

In the case where the possible input meanings are the same as the possible output meanings and m = p, we may think about the “product system” as a communicative dyad. The speaker is A, the hearer is B. The speaker is trying to express a meaning, the input meaning, and producing a signal in order to do so, and the hearer is interpreting that signal to have some meaning, the output meaning. The output meaning the hearer understands is not necessarily the same as the input meaning the speaker was trying to express. If it is different, we may regard the communication as unsuccessful; if it is the same, we may regard the communication as successful.

The key question is: what is the probability that the communication is successful? Given the considerations above, it’s very easy to answer. If the input meaning is i, we’re just looking for the probability that output meaning i given this input meaning. That probability is simply the (i, i) entry of AB, i.e. the ith entry along AB‘s main diagonal.

What if the input meaning isn’t fixed? Then the answer will in general depend on the probability distribution over the possible input meanings. But in the simplest case, where the distribution is uniform (no input meaning is any more probable than any other), the probability of successful communication is just the mean of the input meaning-specific probabilities, that is, the sum of the main diagonal entries of AB, divided by m (the number of the main diagonal entries, i.e. the number of meanings). In linear algebra, we call the sum of the main diagonal entries of a square matrix its trace, and we denote it by tr(C) where C is the matrix. So our formula for the communication success probability p is

\displaystyle (2) \quad p = \frac {\mathrm{tr}(\mathbf A \mathbf B)} m.

If the probability distribution over the input meanings isn’t uniform, the probability of successful communication is just the weighted average of the input meaning-specific probabilities, with the weights being the respective input meaning probabilities. The general formula can therefore be written as

(3) \quad \displaystyle p = \mathrm{tr}(\mathbf A \mathbf B \mathbf D) = \mathrm{tr}(\mathbf D \mathbf A \mathbf B)

where D is the diagonal matrix of size m whose main diagonal is the probability distribution over the input meanings (i.e. for every integer i such that 0 ≤ im, the ith diagonal entry of D is the probability of input meaning i being the one the speaker tries to express). It doesn’t matter whether D is left-multiplied or right-multiplied, because the trace of the product is the same in either case. In the case where the probability distribution over the input meanings is uniform the diagonal entries of D are all equal to 1/m, i.e \mathbf D = \mathbf I_m/m, where Im is the identity matrix of size m, and therefore (3) reduces to (2).

To leave you fully convinced that this formula works, here are some simulations. The 5 graphs below were generated using a Python script which you can view on GitHub. Each one involves 3 possible meanings, 3 possible signals, randomly-generated speaker and hearer matrices and a randomly-generated probability distribution over the input meanings. If you look at the code, you’ll see that the blue line is generated by simulating communication in the obvious way, by randomly drawing an input meaning, randomly drawing a signal based on that particular input meaning, and finally randomly drawing an output meaning based on that particular signal. The position on the x-axis corresponds to the number of trials (individual simulated communicative acts) carried out so far and the position on the y-axis corresponds to the proportion of those trials involving a successful communication (one where the output meaning ended up being the same as the input meaning). For each graph, there were 10 sets of 500 trials; each individual set of trials corresponds to one of the light blue lines, while the darker blue lines gives the results averaged over those ten sets. The horizontal green line indicates the success probability as calculated by our formula. This should be close to the success proportion for a large number of trials, so we should see the blue and green lines converging on the right side of each graph. That is what we see, so the formula works.


The mathematics of surprise, part 1

I’ve split this post into two parts because it would be really long otherwise. Part 2 will be coming up later, hopefully.


Let’s think about how we might define surprisingness as a mathematical quantity.

Surprisingness is a property of events as perceived by observers. After the event occurs, the observer is surprised to a certain degree. The degree of surprisingness depends on how likely the observer thought it was that the event would occur, or to put it briefly the subjective probability of the event. In fact, to a reasonable approximation, at least, the degree of surprisingness depends only on the subjective probability of the event. The greater the subjective probability, the less the surprise.

For example, if you flip a coin and it comes up heads, this is not very surprising—the probability of this event was only 1/2, which is reasonably high. But if you flip ten coins and they all come up heads, you are much more surprised, due to the fact that the probability of ten heads in ten flips is only

\displaystyle \left( \frac 1 2 \right)^{10} = \frac 1 {1024}.

This is a much smaller quantity than 1/2, hence the large increase in surprise. Even if you are unable to do the calculation and work out that the probability of ten heads in ten flips is exactly 1/1024, you are probably aware on an intuitive level that it is a lot smaller than 1/2.

These considerations suggest surprisingness might be definable as a mapping S on [0, 1] (the set of the real numbers between 0 and 1, inclusive), such that for every p \in [0, 1] (i.e. every member p of [0, 1]), the value S(p) is the common surprisingness of the events of subjective probability p. Listed below are three properties such a mapping S should have, if it is to be in accordance with the everyday meaning of “surprise”.

Property 1. For every p \in [0, 1] (i.e. every member p of [0, 1]), the value S(p) should be a non-negative real number, or +\infty (positive infinity).

Justification. Surprisingness seems to be reasonably well-modelled as (to use statistics jargon) an interval variable: two surprisingness values can be compared to see which is less and which is greater, there is a sensical notion of “distance” between surprisingness values, etc. Therefore, it makes sense for surprisingness to be represented by a real number. Moreover, there is a natural minimum surprisingness, namely the surprisingness of an event of subjective probability 1 (i.e. which the observer was sure would happen)—such an event is not at all surprising, and it makes sense to speak of being, e.g., “twice as surprised” by one event compared to another with reference to this natural minimum. To use statistics jargon again, surprisingness is a ratio variable. Naturally, it makes sense to set the minimum value at 0.

Property 2. The mapping S should be strictly decreasing. That is, for every pair p, q of members of [0, 1] such that p < q, we should have S(q) < S(p).

Justification. As said above, events of high subjective probility are less surprising than those of low subjective probability.

Property 3. For every pair p, q of members of [0, 1], we should have S(pq) = S(p) + S(q).

Justification. Suppose A and B are independent events of subjective probabilities p and q respectively. Then, assuming subjective probabilities are assigned in a consistent way, obeying the laws of probability, the subjective probability of A \cap B (the event that both A and B occur) is pq and therefore the surprisingness of A \cap B is S(pq). But given that A and B are independent (so the observation of one does not change the observer’s subjective probability of the other), the surprisingness of A \cap B overall, i.e. S(pq), should be the same as the total surprisingness of the individual observations of A and B, i.e. S(p) + S(q).

Remarkably, these three properties determine the form of S almost exactly (up to a vertical scale factor). Feel free to skip the proof below; it is somewhat long and tedious and doesn’t use any ideas that will be used later on in the post, and I haven’t put in much effort to make it easy to follow.

Theorem 1. The mappings S on [0, 1] having properties 1–3 above are those of the form p : [0, 1] \mapsto -\log_b p (i.e. those mappings S on [0, 1] such that S(p) = - \log_b p for every p \in [0, 1]), where b is a real number greater than 1.

Proof. Suppose S is a mapping on [0, 1] having properties 1–3 above. Let f be the composition of S and x : \mathbb R \mapsto e^x. Then the domain of f is the set of the x \in \mathbb R such that 0 \le e^x \le 1, i.e. x \le 0; and for every pair x, y of non-positive real numbers, we have

\begin{aligned}  f(x, y) &= S(e^{x + y}) \\  &= S(e^x e^y) \\  &= S(e^x) + S(e^y) \\  &= f(x) + f(y).  \end{aligned}

In the case where x = y = 0, we see that f(0 + 0) = f(0) + f(0), and we also have f(0 + 0) = f(0) because 0 + 0 = 0, so combining the two we have f(0) = f(0) + f(0), from which it follows by subtraction of f(0) from both sides that 0 = f(0). Similarly, for every non-positive x \in \mathbb R and every non-negative n \in \mathbb Z (the set \mathbb Z is the set of the integers) we have f((n + 1) x) = f(nx + x) = f(nx) + f(x), and if we assume as an inductive hypothesis that f(nx) = nf(x) it follows that f((n + 1) x) = (n + 1) f(x). This proves by induction that f(nx) = nf(x).

Now, let c = f(-1). For every non-positive x \in \mathbb Q (the set \mathbb Q is the set of the rational numbers), there is a non-positive m \in \mathbb Z and a positive n \in \mathbb Z such that x = m/n, and therefore

\begin{aligned}  nf(x) &= nf \left( \frac m n \right) \\  &= f(m) \\  &= f((-m)(-1)) \\  &= (-m)f(-1) \\  &= -mc,   \end{aligned}

from which it follows that f(x) = -mc/n = -cx.

The equation f(x) = -cx in fact holds for every non-positive x \in \mathbb R, not just for x \in \mathbb Q. To prove this, first, observe that f is strictly decreasing, because for every pair x, y of non-positive real numbers such that x < y, we have e^x < e^y and therefore f(y) = S(e^y) < S(e^x) = f(x) (the mapping S being strictly decreasing by property 2). Second, suppose x is a non-positive real number and f(x) \ne -cx. Then -f(x)/c \ne x, i.e. either -f(x)/c < x (case 1) or x < -f(x)/c (case 2). Because every non-empty interval contains a rational number, it follows that there is an a \in \mathbb Q such that -f(x)/c < a < x (in case 1) or x < a < -f(x)/c (in case 2).

In both cases, we have a \le 0. This is obvious in case 1, because x \le 0. As for case 2, observe first that S is non-negative-valued and therefore -f(x) \le 0. If c is positive, it will then follow that -f(x)/c \le 0 and therefore a \le 0 (because a < -f(x)/c). To prove that c is positive, observe that S is strictly decreasing, and we have f(1) = c, i.e. S(1/e) = c, and f(0) = 0, i.e. S(1) = 0.

Because a \le 0 in both cases, we have f(a) = -ca. And by the strict decreasingness of f it follows that either f(x) < -ca (in case 1) or -ca < f(x) (in case 2). Rearranging these two inequalities gives us a < -f(x)/c (in case 1) or -f(x)/c < a (in case 2). But we already have -f(x)/c < a (in case 1) or a < -f(x)/c (in case 2), so in both cases there is a contradiction. It follows that f(x) \ne -cx cannot hold; we must have f(x) = -cx.

Finishing off, note that for every p \in [0, 1], we have S(p) = S(e^{\log p}) = f(\log p) = -c \log p. Let b = e^{1/c}; then b > 1 because c > 0, and \log b = 1/c, i.e. c = 1/(\log b), from which it follows that S(p) = -(\log p)/(\log b) = -\log_b p for every p \in [0, 1]. From this we can conclude that for every mapping S on [0, 1] such that properties 1–3 hold, we have

\displaystyle S = p : [0, 1] \mapsto -\log_b p,

where b is a real number greater than 1. \blacksquare

In the rest of this post, suppose b is a real number greater than 1. The surprisingness mapping S will be defined by

\displaystyle S = p : [0, 1] \mapsto -\log_b p.

It doesn’t really make any essential difference to the nature of S which value of b is chosen, because for every p \in [0, 1], we have S(p) = -\log_b p = -(\log p)/(\log b) (where the \log symbol with no base specified refers to the natural logarithm, the one to base e), and therefore the choice of base amounts to a choice of vertical scaling. Below are three plots of surprisingness against probability for three different bases (2, e and 10); you can see that the shape of the graph is the same for each base, but the vertical scaling is different.

Note that regardless of the choice of base, we have

\begin{aligned}  S(1) &= -\log_b 1 = -0 = 0, \\  S(0) &= -\log_b 0 = -(-\infty) = \infty.  \end{aligned}

That is, events thought sure to occur are entirely unsurprising, and events thought sure not to occur are infinitely surprising when they do happen.


\displaystyle S \left( \frac 1 b \right) = -\log_b \left( \frac 1 b \right) = -(-1) = 1.

I’m inclined to say on this basis that 2 is the most sensible choice of base, because of all the possible probabilities other than 0 and 1, the one which is most special and thus most deserving of corresponding to the unit surprisingness is probably 1/2, the probability exactly halfway between 0 and 1. However, base e is also quite convenient because it simplifies some of the formulae we’ll see later on in this post.


First, two basic definitions from probability theory (for readers who may not be familiar with them). Suppose A is a finite set.

Definition 1. The probability distributions on A are the positive real-valued mappings f on A such that

\displaystyle (1) \quad \sum_{x \in A} f(x) = 1.

The probability distributions on A can be thought of as random generators of members of A. Each member of A is generated with probability equal to the positive real number less than or equal to 1 associated with that member. The probabilities of generation for each member have to add up to 1, and this is expressed by equation (1). Note that the requirement that the probabilities add up to 1 implies the requirement that the probabilities are each less than or equal to 1, so there is no need to explicitly state the latter requirement in the definition.

Henceforth, suppose f is a probability distribution on A.

Definition 2. For every real-valued mapping \tau on A, the expected value or mean \mathrm E_f(\tau) of the transformation of f by \tau is given by the formula

\displaystyle (2) \quad \mathrm E_f(\tau) = \sum_{x \in A} f(x) \tau(x).

Let x_1, x_2, … and x_n be the members of A and let p_1 = f(x_1), p_2 = f(x_2), … and p_n = f(x_n). Then for every real-valued mapping \tau on A, the expected value \mathrm E_f(\tau) is the weighted average of \tau(x_1), \tau(x_2), … and \tau(x_n), with the weights the respective probabilities p_1, p_2, … and p_n.

If a very large number N of values are generated (independently) by f, then the average value under \tau of these values can be calculated as

\displaystyle \frac 1 N \sum_{k = 1}^n f_k \tau(x_k) = \sum_{k = 1}^n \frac {f_k} N \tau(x_k),

where f_1, f_2, … and f_n are the frequencies of the values x_1, x_2, … and x_n in the sample. Given that N is very large it is very likely that the relative frequencies f_1/N, f_2/N, … and f_n/N will be close to p_1, p_2, … and p_n, respectively, and therefore the average will be close to \mathrm E_f(\tau).

Now, the definition of the key concept of this post:

Definition 3. The expected surprisingness or entropy \mathrm H_f (that’s the Greek capital letter eta, not the Latin capital letter H) of f is \mathrm E_f(S \circ f) (here, S \circ f is the composition of S and f, i.e. x : A \mapsto S(f(x))).

The use of the word “entropy” here is closely related to the use of the same word in thermodynamics. However, I won’t go into the connection here (I don’t really know enough about thermodynamics to be able to talk intelligently about it).

Having the concept of entropy to hand gives us a fun new question we can ask about any given probability distribution: what is its entropy, or, more evocatively, how surprising is it?

By (2), we have

\displaystyle (3) \quad \mathrm H_f = \sum_{x \in A} f(x) S(f(x)).

Using the definition of S, this can also be written as

\displaystyle (4) \quad \mathrm H_f = -\sum_{x \in A} f(x) \log_b f(x).

Using logarithm properties, it can even be written as

\displaystyle \quad \mathrm H_f = \log_b \frac 1 {\prod_{x \in A} f(x)^{f(x)}}.

This shows that \mathrm H_f is the logarithm of the reciprocal of

(5) \quad \displaystyle \prod_{x \in A} f(x)^{f(x)},

which is the geometric weighted average of p_1, p_2, … and p_n, with the weights being identical to the values averaged. A geometric weighted average is similar to an ordinary weighted average, except that the values averaged are multiplied rather than added and the weights are exponents rather than factors. The product (5) can therefore be thought of as the “expected probability” of the value generated by f. To the extent that f is likely to generate one of the members of A which has a low individual probability of being generated, the product (5) is small and accordingly \mathrm H_f is large. This may help give you a slightly more concrete idea why \mathrm H_f can be thought of as the expected surprisingness of f.

Note that the value of \mathrm H_f is determined completely by the probabilities p_1, p_2, … and p_n. The values x_1, x_2, … and x_n are irrelevant in and of themselves. This is evident from the formula (3), in which the expression x only appears as a sub-expression of f(x). To understand how \mathrm H_f is determined by these probabilities, it helps to look at the graph of the mapping p : (0, 1] \mapsto p S(p), which I have plotted below. The entropy \mathrm H_f is a sum of exactly n values of this mapping.

It can be seen from the graph that the mapping has a maximum value. Using calculus, we can figure out what this maximum value is and exactly where it is attained.

Theorem 1. The maximum value attained by p : [0, 1] \mapsto p S(p) is 1/b, and this maximum is attained at the point 1/b (and nowhere else).

Proof. The derivative of p : (0, +\infty) \mapsto p (-\log_b p) is p : (0, +\infty) \mapsto -(1 + \log_b p) (by the product rule), which is positive if p < 1/b (because then \log_b p < -1), equal to 0 if p = 1/b (because then \log_b p = -1), and negative if p > 1/b (because then \log_b p > -1). Therefore p : (0, \infty) \mapsto p (-\log_b p) increases in value from the vicinity of 0 to 1/b and decreases in value from 1/b towards +\infty, which means it attains a minimum at 1/b. \blacksquare

Because \mathrm H_f is a sum of n values of p : [0, 1] \mapsto p S(p), we may conclude that the inequality

\displaystyle \mathrm H_f \le \frac n b

always holds. However, this upper bound on the value of H_f can be improved, as we’ll see below. After all, in proving that it holds we’ve made no use of equation (1).

Cross entropy

The concept of cross entropy is a useful generalization of the concept of entropy.

Suppose g another probability distribution on A. Think of f and g as two different models of the random member-of-A generator: f is the right model (or a more right model, if you don’t like the idea of one single right model), and g is an observer’s working model. The probabilities p_1, p_2, … and p_n can be thought of as the real probabilities of generation of x_1, x_2, … and x_n, respectively, while the probabilities q_1 = g(x_1), q_2 = g(x_2), … and q_n = g(x_n) can be thought of as the observer’s subjective probabilities. Although the real probabilities determine what the observer observes, the subjective probabilities are what determine how surprised the observer is by what they observe. Therefore, if the observer calculates entropy by averaging their surprisingness over a large number of observations, they will get something close to

\displaystyle \sum_{k = 1}^n p_k S(q_k),

i.e. \mathrm E_f(S \circ g). This quantity is called the cross entropy from g to f and denoted \mathrm H(f, g). Note that if f = g then \mathrm H(f, g) = \mathrm H(f) = \mathrm H(g); that’s why the concept of cross entropy is a generalization of the concept of entropy. Note also that \mathrm H(f, g) is not necessarily the same as \mathrm H(g, f).

Your intuition should tell you that \mathrm H(f, g) will always be greater than \mathrm H(f), if f \ne g. Why? Think about it: \mathrm H(f, g) is the expected surprisingness if the observer has the wrong model, \mathrm H(f, f) is the expected surprisingness if the observer has the right model. Having the right model should lead to the observer being better able to predict which member of A is generated and thus to the observer being less likely to be surprised.

It is indeed the case that

\displaystyle (6) \quad \mathrm H(f, g) > \mathrm H(f)

if \mathrm f \ne g (which further reassures us that S is a very good mathematical model of the intuitive notion of surprisingness). The inequality (6) is called Gibbs’ inequality. In order to prove it, first, observe that it may be rewritten as

\displaystyle (7) \quad \mathrm H(f, g) - \mathrm H(f) > 0.

Now, the quantity on the left-hand side of (7) has a name of its own: it’s called the Kullback-Leibler divergence from g to f and denoted \mathrm D(f, g). It measures the “penalty” in increased surprisingness the observer gets for having the wrong model; it can also be thought of as a measure of how different the probability distributions f and g are from each other, hence the name “divergence”. As for the “Kullback-Leibler” part, that’s just because mathematicians have come up with lots of different ways of measuring how different two probability distributions are from each other and Kullback and Leibler were the mathematicians who came up with this particular measure. I won’t be referring to any other such measures in this post, however, so whenever I need to refer to Kullback-Leibler divergence again I’ll just refer to it as “divergence”.

So Gibb’s inequality, reformulated, states that the divergence between two unequal probability distributions is always positive. To prove this, it’s helpful to first write out an explicit expression for \mathrm D(f, g):

\displaystyle \begin{aligned}  \mathrm D(f, g) &= \mathrm H(f, g) - \mathrm H(f) \\  &= \sum_{x \in A} f(x) S(g(x)) - \sum_{x \in A} f(x) S(f(x)) \\  &= \sum_{x \in A} f(x) (S(g(x)) - S(f(x)) \\  &= \sum_{x \in A} f(x) (\log_b f(x) - \log_b g(x)) \\  &= \sum_{x \in A} f(x) \log_b \frac {f(x)} {g(x)}.  \end{aligned}

Second, we prove a lemma.

Lemma 1. For every positive x \in \mathbb R, we have \log_b x \ge (x - 1)/(x \log b), where \log b is the natural logarithm (logarithm to base e) of b, with equality if and only if x = 1.

Proof. The inequality in question is equivalent to \log x \ge (x - 1)/x (by multiplication of both sides by \log b). The right-hand side of this inequality is equal to 1 - 1/x. Consider the mapping x : (0, +\infty) \mapsto \log x - (1 - 1/x). The derivative of this mapping is x : (0, +\infty) \mapsto 1/x - 1/x^2, which is negative if x < 1 (because then x^2 < x and therefore 1/x < 1/x^2), equal to 0 if x = 1 (because then x^2 = x) and positive if x > 1 (because then x^2 > x and therefore 1/x > 1/x^2). Therefore x : (0, +\infty) \mapsto \log x - (1 - 1/x) is strictly decreasing on (0, 1) and strictly increasing on (1, +\infty). It follows that its minimum value is attained at the point 1. And that minimum value is \log x - (1 - 1/x) = 0 - (1 - 1) = 0 - 0 = 0. \blacksquare

Using Lemma 1, we have

\displaystyle (8) \quad \log_b \frac {f(x)} {g(x)} \ge \frac {f(x)/g(x) - 1} {(f(x)/g(x)) \log b} = \frac {f(x) - g(x)} {f(x) \log b}

for every x \in A, with equality if and only if f(x)/g(x) = 1, i.e. f(x) = g(x). Given that f \ne g, there is at least one x \in A such that f(x) \ne g(x) and therefore (8) holds without equality. It follows that

\begin{aligned}  \mathrm D(f, g) &> \sum_{x \in A} f(x) \frac {f(x) - g(x)} {f(x) \log b} \\  &= \frac 1 {\log b} \sum_{x \in A} (f(x) - g(x)) \\  &= \frac 1 {\log b} \left( \sum_{x \in A} f(x) - \sum_{x \in A} g(x) \right) \\  &= \frac {1 - 1} {\log b} \\  &= 0,  \end{aligned}

which proves Gibbs’ inequality.

Gibbs’ inequality is quite powerful and useful. For example, it can be used to figure out what the maximum possible entropy is. Suppose g is such that \mathrm H(f, g) = \mathrm H(g), regardless of the value of f. Then if f \ne g, we have \mathrm H(g) > \mathrm H(f) by Gibbs’ inequality, and therefore \mathrm H(g) is the maximum entropy possible for probability distributions on A and g is the unique probability distribution on A whose entropy is \mathrm H(g). Is there a probability distribution g on A such that \mathrm H(f, g) = \mathrm H(g), regardless of the value of f? There is indeed. Let g be the uniform distribution on A, i.e. the mapping x : A \mapsto 1/n (remember, n is the number of members A has). Then

\displaystyle \sum_{x \in A} g(x) = \frac n n = 1

so g is a probability distribution, and regardless of the value of f we have

\begin{aligned}  \mathrm H(f, g) &= \sum_{x \in A} f(x) \left(-\log_b \frac 1 n \right) \\  &= \left( \sum_{x \in A} f(x) \right) \log_b n \\  &= \log_b n.  \end{aligned}

Therefore, the maximum entropy possible on A is \log_b n, and the uniform distribution on A is the probability distribution which attains this maximum.

A consequence of this is that the divergence of f from the uniform distribution on A is given by

\displaystyle \mathrm D(f, g) = \mathrm H(f, g) - \mathrm H(f) = \log_b - \mathrm H(f)

which is just the negation of \mathrm H(f) plus a constant \log_b n (well, a constant depending on the size of the set A which f is distributed over). Therefore, among probability distributions on A specifically, entropy can be thought of as a measure of divergence from the uniform distribution. Among probability distributions in general, entropy is a measure of both divergence from the uniform distribution and the size of the distributed-over set.

Philosophical implications

So far, we’ve seen that the informal concept of surprisingness can be formalized mathematically with quite a high degree of success—one might even say a surprising degree of success, which is fitting—and that’s pretty neat. But there are also some deeper philosophical issues related to all this. I’ve avoided talking about them up till now because philosophy is not really my field, but I couldn’t let them go completely unmentioned.

Suppose that you know that one of n values x_1, _2, … and x_n values (where n is a positive integer) will be generated by some probability distribution f, but you don’t know the probabilities; you have absolutely no information about the probability distribution other than the set of values it may generate. What should your subjective probability distribution be? A possible answer to this question is that you shouldn’t have one—you simply don’t know the probabilities, and that’s all that can be said. And that’s reasonable enough, especially if you don’t like the concept of subjective probability or think it’s incoherent (e.g. if you’re a strict frequentist when it comes to interpreting probability). But if you accept that subjective probabilities are things it makes sense to talk about, well, the whole point of subjective probabilities is that they represent states of knowledge under uncertainty, so there’s no reason in principle to avoid having a subjective probability distribution just because this particular state of knowledge is particularly uncertain. Of course this subjective probability distribution may change as you gather more information—think of it as your “best guess” to start with, to be improved later on.

There are some people who’d say that you can choose your initial “best guess” on an essentially arbitrary basis, according to your own personal whims. No matter which subjective probability distribution you choose to start with, it will get better and better at modelling reality as you change it as you gather more information; the initial state isn’t really important. The initial state is of interest only in that if we have some updating mechanism in mind, we’d like to be able to prove that convergence to the real probability distribution will always happen independently of the initial state.

There is another position which can be taken, however, which is that there is in fact a certain objectively best subjective probability distribution to start with. This position is associated with the marquis de Laplace (1749–1827), who wrote a classic text on probability theory, the Théorie analytique des probabilités (Laplace was also a great contributor to many other fields of mathematics, as mathematicians back then tended to be—they didn’t do specialization back then). In Laplace’s opinion, the correct distribution to start with was the uniform distribution. That is, given that we know nothing about the probabilities, assume they are all the same (after all, we have no reason to believe any one is larger than any other). This principle is called the principle of indifference.

The concept of probability as a whole can be based on the idea of the principle of indifference. The idea would be that on some level, any probability distribution is over a set of interchangeable values and therefore uniform. However, we are often interested only in whether one of a particular class of values is generated (not in which particular member of that class) and the probability of interest in that case is the sum of the probabilities of each of the values in that class, which, because the underlying distribution is uniform, can also be expressed as the ratio of the number of values in the class to the total number of values which may be generated. I don’t know how far this is how Laplace thought of probability; I don’t want to be too hasty to attribute views to a 19th-century author which might be out of context or just incorrect (after all, I can’t read French so all my information about Laplace is second-hand).

It’s not hard to argue with the principle of indifference. It’s quite difficult to think of any reasonable justification for it at all. It was indeed attacked in vehement terms by later probability theorists, such as John Venn (1834–1923) (that’s right, the Venn diagram guy). In modern times, however, it has been championed by the statistical physicist E. T. Jaynes (1922–1998), who also came up with an interesting extension of it.

In the mathematical section of this post, we saw that the uniform distribution over x_1, x_2, … and x_n was the one with the most entropy, i.e. the one which is “most surprising on average”. Therefore, a natural generalization of the principle of indifference would be to say that in any circumstance in which a subjective probability distribution must be assumed in a situation of uncertainty, one should assume whichever distribution has the most entropy while still being compatible with what is known. For example, if it is known what the mean is then the assumed distribution should be the one with the most entropy among all distributions having that specific mean. This is called the principle of maximum entropy or MaxEnt for short.

The MaxEnt principle makes sense intuitively, sort of, in a way that the principle of indifference on its own doesn’t. If you don’t know something about the probability distribution, you should expect to be surprised more often by the values it generates than if you do know that something. It’s still on fairly shaky ground, though, and I don’t know how far it is accepted by people other than Jaynes as a normative principle, as opposed to just one way in which you might choose the subjective probability distribution to start with, in competition with other ways of doing so on the basis of strictly pragmatic considerations (which seems to be how a lot of people in the practical applications side of things view it). In any case it gives us a philosophical motivation for examining the mathematical problem of finding the probability distributions that have the most entropy given particular constraints. The mathematical problem is interesting itself, but the philosophical connection makes it more interesting.


In part 2 of this post, I’m going to describe how the famous normal distribution can be characterized as a maximum entropy distribution: namely, if it’s the normal distribution with mean \mu (a real number) and standard deviation \sigma (a positive real number), then it’s the absolutely probability distribution over \mathbb R with the most entropy among all absolutely continuous probability distributions over \mathbb R with mean \mu and standard deviation \sigma. That roughly means that by MaxEnt, if you know nothing about a continuous probability distribution other than that it has a particular mean and a particular standard deviation, your best guess to start with is that it’s normal. You can understand the famous Central Limit Theorem from that perspective: as you add up independent, identically distributed random variables, the mean and variance of the distribution in question will be carried over into the sum (elementary statistical theory tells us that the mean of any sum of random variables is the sum of the individual means, and the variance of any sum of independent random variables is the sum of the individual variances), but every other distinctive property of the distribution is gradually kneaded out by the summing process, so that as the sum gets larger and larger all we can say about the probability distribution of the sum is that it has this particular mean and this particular variance. I intend to finish off part 2 with a proof of the Central Limit theorem from this perspective, although that might be a little ambitious. Before that, though, the other big thing which I need to cover in part 2 is defining entropy in the case of a continuous probability distribution—I’ve only been talking about discrete probability distributions in part 1, and it turns out the extension is not entirely a straightforward matter.

Bonferroni’s inequalities and the inclusion-exclusion principle

A non-negative real- or ∞-valued mapping f on a field \mathcal F of sets is said to be finitely additive if and only if for every pair of disjoint sets A and B in \mathcal F, we have

\displaystyle f(A \cup B) = f(A) + f(B).

The most important examples of finitely additive mappings are measures, including probability measures, although not every finitely additive mapping is a measure (measures are mappings on σ-algebras, which are a special sort of field of sets, that are countably additive, which is a stronger property than finite additivity).

From the definition it is immediately evident that finite additivity allows us to express the value under a mapping f on a field \mathcal F of sets of any binary union of pairwise disjoint sets in \mathcal F in terms of the values under f of the individual sets. In fact, the same can be said for unions of any arity, provided they are pairwise disjoint. For every field \mathcal F of sets, every finitely additive mapping f on \mathcal F, every n \in \mathbb Z_0[1] and every n-tuple (A1, A2, …, An) of pairwise disjoint sets in \mathcal F, we have

\displaystyle (1) \quad f \left( \bigcup_{k = 1}^n A_k \right) = \sum_{k = 1}^n A_k.

This can be proven by induction. In the base case of n = 0, we have

\displaystyle \begin{aligned} f(\emptyset) &= f(\emptyset \cup \emptyset) \\ &= f(\emptyset) + f(\emptyset) \end{aligned}

which implies that 0 = f(∅). Now, suppose m \in \mathbb Z_0 and (1) holds for every (m + 1)-tuple (A1, A2, …, Am + 1) of pairwise disjoint sets in \mathcal F. Then

\displaystyle \begin{aligned} f \left( \bigcup_{k = 1}^{m + 1} A_k \right) &= f \left( \left( \bigcup_{k = 1}^m A_k \right) \cup A_{m + 1} \right) \\ &= f \left( \bigcup_{k = 1}^m A_k \right) + f(A_{m + 1}) \\ &= \sum_{k = 1}^m f(A_k) + f(A_{m + 1}) \\ &= \sum_{k = 1}^{m + 1} f(A_k) \end{aligned}

which completes the proof.

But what about unions of sets in \mathcal F that are not necessarily pairwise disjoint? Can the values under f of such unions be expressed in terms of the values under f of the individual sets then? The answer is no. However, such unions’ values under f can be expressed in terms of the values under f of the individual sets and their intersections, by what is known as the inclusion-exclusion principle. For every n \in \mathbb Z_0 and every (A_1, A_2, \dotsc, A_n) \in \mathcal F^{\times n}, we have

\displaystyle (2) \quad f \left( \bigcup_{k = 1}^n A_k \right) = \sum_{\substack{S \subseteq \mathbb Z_1^n \\ S \ne \emptyset}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right).

The sum on the right-hand side of (2) is a rather complicated one, cumbersome to write down as well as computationally expensive to compute by virtue of its large number of terms (one for every non-empty subset of \mathbb Z_1^n, and there are 2n − 1 of those). Therefore, it is also convenient to use Bonferroni’s inequalities, which say that for every m \in \mathbb Z_0, we have

\displaystyle (3) \quad f \left( \bigcup_{k = 1}^n A_k \right) \gtreqless \sum_{\substack{S \subseteq \mathbb Z_1^n \\ 0 < \#S \le m}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} n A_k \right),

with the sign \gtreqless standing for “greater than or equal to, if m is even; less than or equal to, if m is odd”. The sum on the right-hand side of (3) has terms only for every non-empty subset of \mathbb Z_1^n with no more than m members, and there are only 2m − 1 of those. In particular, if m = 1 the terms are just the values under f of the individual sets A1, A2, … and An and therefore we have

\displaystyle f \left( \bigcup_{k = 1}^n A_k \right) \le \sum_{k = 1}^n f(A_k),

which is Boole’s inequality.

Note that Bonferroni’s inequalities hold when mn as well as when m < n. When mn, the sum on the RHS of (3) is exactly the same as the sum on the RHs of (2). Because there are both even and odd integers m such that mn, and any quantity which is both less than or equal to and greater than or equal to another quantity has to be equal to it, it follows that Bonferroni’s inequalities imply that the equation (2) holds and thus generalize the inclusion-exclusion principle.

In order to prove that Bonferroni’s inequalities hold, and thus prove the inclusion-exclusion principle, we can use induction. First, consider the case where m = 0. The only subset of \mathbb Z_1^n of cardinality 0 is the empty one, so in this case the sum on the right-hand side of (3) is empty and (3) therefore reduces to the statement that

\displaystyle f \left( \bigcup_{k = 1}^n A_k \right) \ge 0,

which is true because f is non-negative- or ∞-valued.

Now, suppose M \in \mathbb Z_0 and Bonferroni’s inequalities hold in the case where m = M, and consider the case where m = M + 1. We shall use another inductive proof, within the inductive proof we’re currently carrying out, to show that Bonferroni’s inequalities hold for every n \in \mathbb Z_0 in when m has this particular value. In the case where n = 0, the left-hand side of (3) reduces to f(∅) and the right-hand side is the empty sum once again, because there are no subsets of \mathbb Z_1^0 (\mathbb Z_1^0 is the set of the integers greater than or equal to 1 and less than or equal to 0, and there are of course no such integers). Because f(∅) = 0 it follows that (3) is true, regardless of the direction of the inequality required.

As for the successor case, suppose that N \in \mathbb Z_0 and Bonferroni’s inequalities hold in the case where n = N. Consider the case where n = N + 1. It is helpful at this point to write (3) for the given values of m and n as

\displaystyle (-1)^{M + 1} (f(A) - a) \ge 0,


\displaystyle \begin{aligned} A &= \bigcup_{k = 1}^{N + 1} A_k, \\ a &= \sum_{\substack{S \subseteq \mathbb Z_1^{N + 1} \\ 0 < \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right). \end{aligned}

If we also let

\displaystyle \begin{aligned} B &= \bigcup_{k = 1}^N A_k, \\ b &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right), \end{aligned}

then we have (−1)M + 1(f(B) − b) ≥ 0 because Bonferroni’s inequalities hold in the case where n = N. Now, how are f(A) and a related to f(B) and b?

We obviously have A = BAN + 1, but B and AN + 1 are not necessarily disjoint so this doesn’t immediately tell us anything about the relationship of f(A) and f(B). However, AN + 1 is certainly disjoint from B \ AN + 1, so we have

(4) \quad \displaystyle f(A) = f(B \setminus A_{N + 1}) + f(A_{N + 1}).

That’s about all we can usefully say for the moment. As for a and b, well, the terms of b are a subset of those of a so it’s quite easy to write down the difference ab. If we manipulate that difference a little bit, we can start getting it to look like something that could occur on the right-hand side of (3).

\displaystyle \begin{aligned} a - b &= \sum_{\substack{S \subseteq \mathbb Z_1^{N + 1} \\ N + 1 \in S \\ \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right) \\ &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ \#S \le M}} (-1)^{\#S + 2} f \left( \left( \bigcap_{k \in S} A_k \right) \cap A_{N + 1} \right) \\ &= f(A_{N + 1}) - \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \cap A_{N + 1} \right) \end{aligned}


\displaystyle \begin{aligned} C &= \bigcup_{k = 1}^N A_k \cap A_{N + 1}, \\ c &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \cap A_{N + 1} \right). \end{aligned}

Then we have (−1)M(f(C) − c) ≥ 0, because Bonferroni’s inequalities hold in the case where m = M, and we have ab = f(AN + 1) − c.

Now, if we use the distributivity of intersection over union we can rewrite C as BAN + 1. It follows that B is the disjoint union of C and the set B \ AN + 1 which turned up above when we were contemplating the relationship of f(A) and f(B), and therefore we have f(B) = f(C) + f(B \ AN + 1). Using this new equation we may rewrite (4) as

(4) \quad \displaystyle f(A) = f(B) - f(C) + f(A_{N + 1}),

from which it follows that f(A) − f(B) = f(AN + 1) − f(C)—a nicely analogous equation to ab = f(AN + 1) − c. Finally, let us add the two quantities (−1)M + 1(f(B) − b) and (−1)M(f(C) − c) ≥ 0 which we know to be non-negative. The sum of two non-negative quantities is non-negative also, so we have

\displaystyle \begin{aligned} 0 &\le (-1)^{M + 1} (f(B) - b) + (-1)^M (f(C) - c) \\ &= (-1)^{M + 1} (f(B) - b - (f(C) - c)) \\ &= (-1)^{M + 1} (f(B) - f(C) - (b - c)) \\ &= (-1)^{M + 1} (f(A) - f(A_{N + 1}) - (a - f(A_{N + 1}))) \\ &= (-1)^{M + 1} (f(A) - a). \end{aligned}


[1] For every (a, b) \in \mathbb Z^{\times 2}, I denote \{n \in \mathbb Z : a \le n \le b\}, \{n \in \mathbb Z : a \le n\} and \{n \in \mathbb Z : n \le b\} by \mathbb Z_a^b, \mathbb Z_a and \mathbb Z^b respectively.

A very simple stochastic model of diachronic change

1. The discrete process

1.1. The problem

Consider an entity (for example, a language) which may or may not have a particular property (for example, obligatory coding of grammatical number). For convenience and interpretation-neutrality, we shall say that the entity is positive if it has this property and negative if it does not have this property. Consider the entity as it changes over the course of a number of events (for example, transmissions of the language from one generation to another) in which the entity’s state (whether it is positive or negative) may or may not change. For every nonnegative integer {n}, let {X_n} represent the entity’s state after exactly {n} events have occurred, with negativity being represented by 0 and positivity being represented by 1. The initial state {X_0} is a constant parameter of the model, but the states at other times are random variable whose “success” probabilities (i.e. values of 1 under their probability mass functions) are determined by {X_0} and the other parameters of the model.

The other parameters of the model, besides {X_0}, are denoted by {p} and {q}. These represent the probabilities that an event will change the state from negative to positive or from positive to negative, respectively. They are assumed to be constant across events—this assumption can be thought of as an interpretation of the uniformitarian principle familiar from historical linguistics and other fields. I shall call a change of state from negative to positive a gain and a change of state from positive to negative a loss, so that {p} can be thought of as the gain rate per event and {q} can be thought of as the loss rate per event.

Note that the gain resp. loss probability is {p}/{q} only if the state is negative resp. positive as the event begins. If the state is already positive resp. negative as the event begins then it is impossible for a further gain resp. loss to occur and therefore the gain resp. loss probability is 0 (but the loss resp. gain probability is {q}/{p}). Thus the random variables {X_1}, {X_2}, {X_3}, … are not necessarily independent of one another.

I am aware that there’s a name for a sequence of random variables that are not necessarily independent of one another, namely “stochastic process”. However, that is about the extent of what I know about stochastic processes. I think the thing I’m talking about in this post is a very simple example of a stochastic process–an appropriate name for it would be the gain-loss process. If you know something about stochastic processes it might seem very trivial, but it was an interesting problem for me to try to figure out knowing nothing already about stochastic processes.

1.2. The solution

Suppose {n} is a nonnegative integer and consider the state {X_{n + 1}} after exactly {n + 1} events have occurred. If the entity is negative as the {(n + 1)}th event begins, the probability of gain during the {(n + 1)}th event is {p}. If the entity is positive as the {(n + 1)}th event begins, the probability of loss during the {(n + 1)}th event is {q}. Now, as the {(n + 1)}th event begins, exactly {n} events have already occurred. Therefore the probability that the entity is negative as the {(n + 1)}th event begins is {\mathrm P(X_n = 0)} and the probability that the entity is positive as the {(n + 1)}th event begins is {\mathrm P(X_n = 1)}. It follows by the law of total probability that

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1) &= p (1 - \mathrm P(X_n = 1)) + (1 - q) \mathrm P(X_n = 1) \\ &= p - p \mathrm P(X_n = 1) + \mathrm P(X_n = 1) - q \mathrm P(X_n = 1) \\ &= p - (p - 1 + q) \mathrm P(X_n = 1) \\ &= p + (1 - p - q) \mathrm P(X_n = 1). \end{aligned}

This recurrence relation can be solved using the highly sophisticated method of “use it to find general equations for the first few terms in the sequence, extrapolate the pattern, and confirm that the extrapolation is valid using a proof by induction”. I’ll spare you the laborious first phrase, and just show you the second and third. The solution is

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= p \sum_{i = 0}^{n - 1} (1 - p - q)^i, \\ \mathrm P(X_n = 1 | X_0 = 1) &= 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i. \end{aligned}

Just so you can check that this is correct, the proofs by induction for the separate cases are given below.

Case 1 ({X_0 = 0)}. Base case. The expression

\displaystyle p \sum_{i = 0}^{n - 1} (1 - p - q)^i

evaluates to 0 if {n = 0}, because the sum is empty.

Successor case. For every nonnegative integer {n} such that

\displaystyle \mathrm P(X_n = 1 | X_0 = 0) = p \sum_{i = 0}^{n - 1} (1 - p - q)^i,

we have

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1 | X_0 = 0) &= p + (1 - p + q) \mathrm P(X_n = 1 | X_0 = 0) \\ &= p + (1 - p - q) p \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= p + p (1 - p - q) \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= p \left( 1 + \sum_{i = 0}^{n - 1} (1 - p - q)^{i + 1} \right) \\ &= p \sum_{j = 0}^n (1 - p - q)^j. \end{aligned}

Case 2 ({X_0 = 1}). Base case. The expression

\displaystyle 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i

evaluates to 1 if {n = 0}, because the sum is empty.

Successor case. For every nonnegative integer {n} such that

\displaystyle \mathrm P(X_n = 1 | X_0 = 1) = 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i,

we have

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1 | X_0 = 1) &= p + (1 - p + q) \mathrm P(X_n = 1 | X_0 = 1) \\ &= p + (1 - p - q) \left( 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i \right) \\ &= p + 1 - p - q - (1 - p - q) q \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= 1 - q - q (1 - p - q) \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= 1 - q \left( 1 + \sum_{i = 0}^{n - 1} (1 - p - q)^{i + 1} \right) \\ &= 1 - q \sum_{j = 0}^n (1 - p - q)^j. \end{aligned}

I don’t know if there is any way to make sense of why exactly these equations are the way they are; if you have any ideas, I’d be interested to hear your comments. There is a nice way I can see of understanding the difference between the two cases. Consider an additional gain-loss process {B} which changes in tandem with the gain-loss process {A} that we’ve been considering up till just now, so that its state is always the opposite of that of {A}. Then the gain rate of {B} is {q} (because if {B} gains, {A} loses) and the lose rate of {B} is {p} (because if {B} loses, {A} gains). And for every nonnegative integer {n}, if we let {Y_n} denote the state of {B} after exactly {n} events have occurred, then

\displaystyle \mathrm P(Y_n = 1) = 1 - \mathrm P(X_n = 1)

because {Y_n = 1} if and only if {X_n = 0}. Of course, we can also rearrange this equation as {\mathrm P(X_n = 1) = 1 - \mathrm P(Y_n = 1)}.

Now, we can use the equation for Case 1 above, but with the appropriate variable names for {B} substituted in, to see that

\displaystyle \mathrm P(Y_n = 1 | Y_0 = 0) = q \sum_{i = 0}^{n - 1} (1 - q - p)^i,

and it then follows that

\displaystyle \mathrm P(X_n = 1 | X_0 = 1) = 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i.

Anyway, you may have noticed that the sum

\displaystyle \sum_{i = 0}^{n - 1} (1 - p - q)^i

which appears in both of the equations for {\mathrm P(X_n = 1)} is a geometric progression whose common ratio is {1 - p - q}. If {1 - p - q = 1}, then {p + q = 0} and therefore {p = q = 0} (because {p} and {q} are probabilities, and therefore non-negative). The probability {\mathrm P(X_n = 1)} is then simply constant at 0 if {X_0 = 0} (because gain is impossible) and constant at 1 if {X_0 = 1} (because loss is impossible). Outside of this very trivial case, we have {1 - p - q \ne 1}, and therefore the geometric progression may be written as a fraction as per the well-known formula:

\displaystyle \begin{aligned} \sum_{i = 0}^{n - 1} (1 - p - q)^i &= \frac {1 - (1 - p - q)^n} {1 - (1 - p - q)} \\ &= \frac {1 - (1 - p - q)^n} {p + q}. \end{aligned}

It follows that

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= \frac {p (1 - (1 - p - q)^n)} {p + q}, \\ \mathrm P(X_n = 1 | X_0 = 1) &= 1 - \frac {q (1 - (1 - p - q)^n)} {p + q} \\ &= \frac {p + q - q (1 - (1 - p - q)^n)} {p + q} \\ &= \frac {p + q - q + q (1 - p - q)^n} {p + q} \\ &= \frac {p + q (1 - p - q)^n} {p + q}. \end{aligned}

From these equations it is easy to see the limiting behaviour of the gain-loss process as the number of events approaches {\infty}. If {1 - p - q = -1}, then {p + q = 2} and therefore {p = q = 1} (because {p} and {q} are probabilities, and therefore not greater than 1). The equations in this case reduce to

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= \frac {1 - (-1)^n} 2, \\ \mathrm P(X_n = 1 | X_0 = 1) &= \frac {1 + (-1)^n} 2, \end{aligned}

which show that the state simply alternates deterministically back and forth between positive and negative (because {(1 - (-1)^n)/2} is 0 if {n} is even and 1 if {n} is odd and {(1 + (-1)^n)/2} is 1 if {n} is even and 0 if {n} is odd).

Otherwise, we have {|1 - p - q| < 1} and therefore

\displaystyle \lim_{n \rightarrow \infty} (1 - p - q)^n = 0.

Now the equations for {\mathrm P(X_n = 1 | X_0 = 0)} and {\mathrm P(X_n = 1 | X_0 = 1)} above are the same apart from the term in the numerator which contains {(1 - p - q)^n} as a factor, as well as another factor which is independent of {n}. Therefore, regardless of the value of {X_0},

\displaystyle \lim_{k \rightarrow \infty} \mathrm P(X_k = 1) = \frac p {p + q}.

This is a nice result: if {n} is sufficiently large, the dependence of {X_n} on {X_0}, {X_1}, … and {X_{n - 1}} is negligible and its success probability is negligibly different from {p/(p + q)}. That it is this exact quantity sort of makes sense: it’s the ratio of the gain rate to the theoretical rate of change of state in either direction that we would get if both a gain and loss could occur in a single event.

In case you like graphs, here’s a graph of the process with {X_0 = 0}, {p = 1/100}, {q = 1/50} and 500 events. The x-axis is the number of events that have occurred and the y-axis is the observed frequency, divided by 1000, of the state being positive after this number of events has occurred (for the blue line) or the probability of the state being positive according to the equations described in this post (for the green line). If you want to, you can view the Python code that I used to generate this graph (which is actually capable of simulating multiple-trait interactions, although I haven’t tried solving it in that case) on GitHub.


2. The continuous process

2.1. The problem

Let us now consider the same process, but continuous rather than discrete. That is, rather than the gains and losses occuring over the course of a discrete sequence of events, we now have a continuous interval in time, during which at any point losses and gains might occur instantaneously. The state of the process at time {t} shall be denoted {X(t)}. Although multiple gains and losses may occur during an arbitrary subinterval, we may assume for the purpose of approximation that during sufficiently short subintervals only one gain or loss, or none, may occur, and the probabilities of gain and loss are directly proportional to the length of the subinterval. Let {\lambda} be the constant of proportionality for gain and let {\mu} be the constant of proportionality for loss. These are the continuous model’s analogues of the {p} and {q} parameters in the discrete model. Note that they may be greater than 1, unlike {p} and {q}.

2.2. The solution

Suppose {t} is a non-negative real number and {n} is a positive integer. Let {\Delta t = 1/n}. The interval in time from time 0 to time {t} can be divided up into {n} subintervals of length {\Delta t}. If {\Delta t} is small enough, so that the approximating assumptions described in the previous paragraph can be made, then the subintervals can be regarded as discrete events, during each of which gain occurs with probability {\lambda \Delta t} if the state at the start point of the subinterval is negative and loss occurs with probability {\mu \Delta t} if the state at the start point of the subinterval is positive. For every positive integer {k} between 0 and {n} inclusive, let {Y_k} denote the state of this discrete approximation of the process at time {t + k \Delta t}. Then for every integer {k} between 0 and {n} (inclusive) we have

\displaystyle \begin{aligned} \mathrm P(Y_k = 1 | Y_0 = 0) &= \frac {\lambda \Delta t (1 - (1 - \lambda \Delta t - \mu \Delta t)^k)} {\lambda \Delta t + \mu \Delta t}, \\ \mathrm P(Y_k = 1 | Y_0 = 1) &= \frac {\lambda \Delta t + \mu \Delta t (1 - \lambda \Delta t - \mu \Delta t)^k} {\lambda \Delta t + \mu \Delta t}, \end{aligned}

provided {\lambda} and {\mu} are not both equal to 0 (in which case, just as in the discrete case, the state remains constant at whatever the initial state was).

Many of the {\Delta t} factors in this equation can be cancelled out, giving us

\displaystyle \begin{aligned} \mathrm P(Y_k = 1 | Y_0 = 0) &= \frac {\lambda (1 - (1 - (\lambda + \mu) \Delta t)^k)} {\lambda + \mu}, \\ \mathrm P(Y_k = 1 | Y_0 = 1) &= \frac {\lambda + \mu (1 - (\lambda + \mu) \Delta t)^k} {\lambda + \mu}. \end{aligned}

Now consider the case where {k = n} in the limit {n} approaches {\infty}. Note that {\Delta t} approaches 0 at the same time, because {\Delta t = t/n}, and therefore the limit of {(1 - (\lambda + \mu) \Delta t)^n} is not simply 0 as in the discrete case. If we rewrite the expression as

\displaystyle \left( 1 - \frac {t (\lambda + \mu)} n \right)^n

and make the substitution {n = -mt(\lambda + \mu)}, giving us

\displaystyle \left( 1 + \frac 1 m \right)^{-mt(\lambda + \mu)} = \left( \left( 1 + \frac 1 m \right)^m \right)^{-t(\lambda + \mu)},

then we see that the limit is in fact {e^{-t(\lambda + \mu)}}, an exponential function of {t}. It follows that

\displaystyle \begin{aligned} \mathrm P(X(t) = 1 | X(0) = 0) = \lim_{n \rightarrow \infty} \mathrm P(Y_n = 1 | Y_0 = 0) &= \frac {\lambda (1 - e^{-t(\lambda + \mu)})} {\lambda + \mu}, \\ \mathrm P(X(t) = 1 | X(0) = 1) = \lim_{n \rightarrow \infty} \mathrm P(Y_n = 1 | Y_0 = 1) &= \frac {\lambda + \mu e^{-t(\lambda + \mu)}} {\lambda + \mu}. \end{aligned}

This is a pretty interesting result. I initially thought that the continuous process would just have the solution {\mathrm P(X_n = 1) = \lambda/{\lambda + \mu}}, completely independent of {X_0} and {t}, based on the idea that it could be viewed as a discrete process with an infinitely large number of events within every interval of time, so that it would constantly behave like the discrete process does in the limit as the number of events approaches infinity. In fact it turns out that it still behaves like the discrete process, with the effect of the initial state never quite disappearing—although it does of course disappear in the limit as {t} approaches {\infty}, because {e^{-t(\lambda + \mu)}} approaches 0:

\displaystyle \lim_{t \rightarrow \infty} \mathrm P(X(t) = 1) = \frac {\lambda} {\lambda + \mu}.

Greenberg’s Universal 38 and its diachronic implications

This post is being written hastily and therefore may be more incomprehensible than usual.

Greenberg’s Universal 38 says:

Where there is a case system, the only case which ever has only zero allomorphs is the one which includes among its meanings that of the subject of the intransitive verb. (Greenberg, 1963, 59)

A slightly better way to put this (more clearly clarifying what the universal says about languages that code case by means of non-concatenative morphology) might be: if a language makes a distinction between nominative/absolutive and ergative/accusative case by means of concatenative morphology, then there is always at least one ergative/accusative suffix form with nonzero phonological substance. Roughly, there’s a preference for there to be an ergative/accusative affix rather than a nominative/absolutive affix (but it’s OK if there are phonologically substantive affixes for both cases, or if ergative/accusative is zero-coded in some but not all environments).

On the other hand, Greenberg’s statement of the universal makes clear a rather interesting property of it: if you’re thinking about which argument can be zero-coded in a transitive sentence, Universal 38 actually says that it depends on what happens in an intransitive sentence: the one which can be zero-coded is the one which takes the same case that arguments in intransitive sentences take. If the language is accusative, then the nominative, the agenty argument, can be zero-coded, and the accusative, the patienty argument, can’t. If the language is ergative, then the absolutive, the patienty argument, can be zero-coded, and the ergative argument, can’t. (I mean can’t as in can’t be zero-coded in all environments.)

This is a problem, perhaps, for those who think of overt coding preferences and other phenomena related to “markedness” (see Haspelmath, 2006, for a good discussion of the meaning of markedness in linguistics) as related to the semantics of the category values in question. Agenty vs. patienty is the semantic classification of the arguments, but depending on the morphosyntactic alignment of the language, it can be either the agenty or patienty arguments which are allowed to be zero-coded. This seems like a case where Haspelmath’s preferred explanation of all phenomena related to markedness—differences in usage frequency—is much more preferable, although I don’t think he mentions it in his paper (but I might have missed it—I’m not going to check, because I’m trying to not spend too long writing this post).

Anyway, one thing I wonder about this universal (and a thing it’s generally interesting to wonder about with respect to any universal) is how it’s diachronically preserved. For it’s quite easy to imagine ways in which a language could end up in a situation where it has a zero-coded nominative/absolutive due to natural changes. Let’s say it has both cases overtly coded to start with; let’s say the nominative suffix is -ak and the accusative suffix is -an. Now final -n gets lost, with compensatory nasalization, and then vowels in absolute word-final position get elided. (That’s a perfectly natural sequence of sound changes; it happened in the history of English, cf. Proto-Indo-European *yugóm > Proto-Germanic *juką > English yoke.) The language would then end up with nominative -ak and a zero-coded accusative, thus violating Universal 38. So… well, I don’t actually know how absolute Universal 38 is, perhaps it has some exceptions (though I don’t know of any), and if there are enough exceptions we might be able to just say that it’s these kinds of developments that are responsible for the exceptions. But if the exceptions are very few, then there’s probably some way in which languages which end up with zero-coded accusatives like this are hastily “corrected” to keep them in line with the universal. Otherwise we’d expect to see more exceptions. Here’s one interesting question: how would that correction happen? It could just be that a postposition gets re-accreted or something and the accusative ends up being overtly coded once again. But it could also be that subjects of intransitive sentences start not getting the -ak suffix added to them, so that you get a shift from accusative to ergative morphosyntactic alignment, with the zero-coded accusative becoming a perfectly Universal 38-condoned zero-coded absolutive. That’d be pretty cool: a shift in morphosyntactic alignment triggered simply by a coincidence of sound change. Is any such development attested? Somebody should have it happen in a conlang family.

According to Wichmann (2009), morphosyntactic alignment is a “stable” feature which might be a problem if alignment shifts can occur in the manner described above. But then again, I wonder how common overt coding of both nominative/absolutive and ergative/accusative is, actually—most Indo-European languages that mark cases have it, but I did a quick survey of some non-IE languages with case marking, both accusative (Finnish, Hungarian, Turkish, Tamil, Quechua) and ergative (Basque, Dyirbal) and they all seem to code nominative/absolutive by zero (well, Basque codes absolutive overtly in one of its declensions, but not in the other two). If it’s pretty rare for both to be overtly coded, then this correction doesn’t have to happen very often, but it would surely need to happen sometimes if Universal 38 is absolute or close to it.


Greenberg, J. H., 1963. Some universals of grammar with particular reference to the order of meaningful elements. In Greenberg, J. H. (ed.), Universals of Language, 73–113. MIT Press.

Haspelmath, M. (2006). Against markedness (and what to replace it with). Journal of linguistics, 42(01), 25–70.

Wichmann, S. & Holman, E. W. (2009). Temporal stability of linguistic typological features. Retrieved from https://www.academia.edu/13245711/Temporal_Stability_of_Linguistic_Typological_Features.

That’s OK, but this’s not OK?

Here’s something peculiar I noticed the other day about the English language.

The word is (the third-person singular present indicative form of the verb be) can be ‘contracted’ with a preceding noun phrase, so that it is reduced to an enclitic form -‘s. This can happen after pretty much any noun phrase, no matter how syntactically complex:

(1) he’s here

/(h)iːz ˈhiːə/[1]

(2) everyone’s here

/ˈevriːwɒnz ˈhiːə/

(3) ten years ago’s a long time

/ˈtɛn ˈjiːəz əˈgəwz ə ˈlɒng ˈtajm/

However, one place where this contraction can’t happen is immediately after the proximal demonstrative this. This is strange, because it can certainly happen after the distal demonstrative that, and one wouldn’t expect these two very similar words to behave so differently:

(4) that’s funny
/ˈðats ˈfʊniː/

(5) *this’s funny

There is a complication here which I’ve kind of skirted over, though. Sure, this’s funny is unacceptable in writing. But what would it sound like, if it was said in speech? Well, the -’s enclitic form of is can actually be realized on the surface in a couple of different ways, depending on the phonological environment. You might already have noticed that it’s /-s/ in example (4), but /-z/ in examples (1)-(3). This allomorphy (variation in phonological form) is reminiscent of the allomorphy in the plural suffix: cats is /ˈkats/, dogs is /ˈdɒgz/, horses is /ˈhɔːsɪz/. In fact the distribution of the /-s/ and /-z/ realizations of -‘s is exactly the same as for the plural suffix: /-s/ appears after voiceless non-sibilant consonants and /-z/ appears after vowels and voiced non-sibilant consonants. The remaining environment, the environment after sibilants, is the environment in which the plural suffix appears as /-ɪz/. And this environment turns out to be exactly the same environment in which -’s is unacceptable in writing. Here are a couple more examples:

(6) *a good guess’s worth something (compare: the correct answer’s worth something)

(7) *The Clash’s my favourite band (compare: Pearl Jam’s my favourite band)

Now, if -‘s obeys the same rules as the plural suffix then we’d expect it to be realized as /-ɪz/ in this environment. However… this is exactly the same sequence of segments that the independent word is is realized as when it is unstressed. One might therefore suspect that in sentences like (8) below, the morpheme graphically represented as the independent word is is actually the enclitic -‘s, it just happens to be realized the same as the independent word is and therefore not distinguished from it in writing. (Or, perhaps it would be more elegant to say that the contrast between enclitic and independent word is neutralized in this environment.)

(8) The Clash is my favourite band

Well, this is (*this’s) a very neat explanation, and if you do a Google search for “this’s” that’s pretty much the explanation you’ll find given to the various other confused people who have gone to websites like English Stack Exchange to ask why this’s isn’t a word. Unfortunately, I think it can’t be right.

The problem is, there are some accents of English, including mine, which have /-əz/ rather than /-ɪz/ in the allomorph of the plural suffix that occurs after sibilants, while at the same time pronouncing unstressed is as /ɪz/ rather than /əz/. (There are minimal pairs, such as peace is upon us /ˈpiːsɪz əˈpɒn ʊz/ and pieces upon us /ˈpiːsəz əˈpɒn ʊz/.) If the enclitic form of is does occur in (8) then we’d expect it to be realized as /əz/ in these accents, just like the plural suffix would be in the same environment. This is not what happens, at least in my own accent: (8) can only have /ɪz/. Indeed, it can be distinguished from the minimally contrastive NP (9):

(9) The Clash as my favourite band

In fact this problem exists in more standard accents of English as well, because is is not the only word ending in /-z/ which can end a contraction. The third-person singular present indicative of the verb have, has, can also be contracted to -‘s, and it exhibits the expected allomorphy between voiceless and voiced realizations:

(10) it’s been a while /ɪts ˈbiːn ə ˈwajəl/

(11) somebody I used to know’s disappeared /ˈsʊmbɒdiː aj ˈjuːst tə ˈnəwz dɪsəˈpijəd/

But like is it does not contract, at least in writing, after sibilants, although it may drop the initial /h-/ whenever it’s unstressed:

(12) this has gone on long enough /ˈðɪs (h)əz gɒn ɒn lɒng əˈnʊf/

I am not a native speaker of RP, so, correct me if I’m wrong. But I would be very surprised if any native speaker of RP would ever pronounce has as /ɪz/ in sentences like (12).

What’s going on? I actually do think the answer given above—that this’s isn’t written because it sounds exactly the same as this is—is more or less correct, but it needs elaboration. Such an answer can only be accepted if we in turn accept that the plural -s, the reduced -‘s form of is and the reduced -‘s form of has do not all exhibit the same allomorph in the environment after sibilants. The reduced form of is has the allomorph /-ɪz/ in all accents, except in those such as Australian English in which unstressed /ɪ/ merges with schwa. The reduced form of has has the allomorph /-əz/ in all accents. The plural suffix has the allomorph /-ɪz/ in some accents, but /-əz/ in others, including some in which /ɪ/ is not merged completely with schwa and in particular is not merged with schwa in the unstressed pronunciation of is.

Introductory textbooks on phonology written in the English language are very fond of talking about the allomorphy of the English plural suffix. In pretty much every treatment I’ve seen, it’s assumed that /-z/ is the underlying form, and /-s/ and /-əz/ are derived by phonological rules of voicing assimilation and epenthesis respectively, with the voicing assimilation crucially coming after the epenthesis (otherwise we’d have an additional allomorph /-əs/ after voiceless sibilants, while /-əz/ would only appear after voiced sibilants). This is the best analysis when the example is taken in isolation, because positing an epenthesis rule allows the phonological rules to be assumed to be productive across the entire lexicon of English. If such a fully productive deletion rule were posited, then it would be impossible to account for the pronunciation of a word like Paulas (‘multiple people named Paula’) with /-əz/ on the surface, whose underlying form would be exactly the same, phonologically, as Pauls (‘multiple people named Paul’). (This example only works if your plural suffix post-sibilant allomorph is /-əz/ rather than /-ɪz/, but a similar example could probably be exhibited in the other case.) One could appeal to the differing placement of the morpheme boundary but this is unappealing.

However, the assumption that a single epenthesis rule operating between sibilants is productive across the entire English lexicon has to be given up, because ‘s < is and ‘s < has have different allomorphs after sibilants! Either they are accounted for by two different lexically-conditioned epenthesis rules (which is a very unappealing model) or the allomorphs with the vowels are actually the underlying ones, and the allomorphs without the vowels are produced by a not phonologically-conditioned but at least (sort of) morphologically-conditioned deletion rule that elides fully reduced unstressed vowels (/ə/, /ɪ/) before word-final obstruents. This rule only applies in inflectional suffixes (e.g. lettuce and orchid are immune), and even there it does not apply unconditionally because the superlative suffix -est is immune to it. But this doesn’t bother me too much. One can argue that the superlative is kind of a marginal inflectional category, when you put it in the company of the plural, the possessive and the past tense.

A nice thing about the synchronic rule I’m proposing here is that it’s more or less exactly the same as the diachronic rule that produced the whole situation in the first place. The Old English nom./acc. pl., gen. sg., and past endings were, respectively, -as, -es, -aþ and -ede. In Middle English final schwa was elided unconditionally in absolute word-final position, while in word-final unstressed syllables where it was followed by a single obstruent it was gradually eliminated by a process of lexical diffusion from inflectional suffix to inflectional suffix, although “a full coverage of the process in ME is still outstanding” (Minkova 2013: 231). Even the superlative suffix was reduced to /-st/ by many speakers for a time, but eventually the schwa-ful form of this suffix prevailed.

I don’t see this as a coincidence. My inclination, when it comes to phonology, is to see the historical phonology as essential for understanding the present-day phonology. Synchronic phonological alternations are for the most part caused by sound changes, and trying to understand them without reference to these old sound changes is… well, you may be able to make some progress but it seems like it’d be much easier to make progress more quickly by trying to understand the things that cause them—sound changes—at the same time. This is a pretty tentative paragraph, and I’m aware I’d need a lot more elaboration to make a convincing case for this stance. But this is where my inclination is headed.

[1] The transcription system is the one which I prefer to use for my own accent of English.


Minkova, D. 2013. A Historical Phonology of English. Edinburgh University Press.

A language with no word-initial consonants

I was having a look at some of the squibs in Linguistic Inquiry today, which are often fairly interesting (and have the redeeming quality that, when they’re not interesting, they’re at least short), and there was an especially interesting one in the April 1970 (second ever) issue by R. M. W. Dixon (Dixon 1970) which I’d like to write about for the benefit of those who can’t access it.

In Olgolo, a variety of Kunjen spoken on the Cape York Peninsula, there appears to been a sound change that elided consonants in initial position. That is, not just consonants of a particular variety, but all consonants. As a result of this change, every word in the language begins with a vowel. Examples (transcriptions in IPA):

  • *báma ‘man’ > áb͡ma
  • *míɲa ‘animal’ > íɲa
  • *gúda ‘dog’ > úda
  • *gúman ‘thigh’ > úb͡man
  • *búŋa ‘sun’ > úg͡ŋa
  • *bíːɲa ‘aunt’ > íɲa
  • *gúyu ‘fish’ > úyu
  • *yúgu ‘tree, wood’ > úgu

(Being used to the conventions of Indo-Europeanists, I’m a little disturbed by the fact that Dixon doesn’t identify the linguistic proto-variety to which the proto-forms in these examples belong, nor does he cite cognates to back up his reconstruction. But I presume forms very similar to the proto-forms are found in nearby Paman languages. In fact, I know for a fact that the Uradhi word for ‘tree’ is /yúku/ because Black (1993) mentions it by way of illustrating the remarkable Uradhi phonological rule which inserts a phonetic [k] or [ŋ] after every vowel in utterance-final position. Utterance-final /yúku/ is by this means realized as [yúkuk] in Uradhi.)

(The pre-stopped nasals in some of these words [rather interesting segments in of themselves, but fairly widely attested, see the Wikipedia article] have arisen due to a sound change occurring before the word-initial consonant elision sound change, which pre-stopped nasals immediately after word-initial syllables containing a stop or *w followed by a short vowel. This would have helped mitigate the loss of contrast resulting from the word-initial consonant elision sound change a little, but only a little, and between e.g. the words for ‘animal’ and ‘aunt’ homophony was not averted because ‘aunt’ had an originally long vowel [which was shortened in Olgolo by yet another sound change].)

Dixon says Olgolo is the only language he’s heard of in which there are no word-initial consonants, although it’s possible that more have been discovered since 1970. However, there is a caveat to this statement: there are monoconsonantal prefixes that can be optionally added to most nouns, so that they have an initial consonant on the surface. There are at least four of these prefixes, /n-/, /w-/, /y-/ and /ŋ-/; however, every noun seems to only take a single one of these prefixes, so we can regard these three forms as lexically-conditioned allomorphs of a single prefix. The conditioning is in fact more precisely semantic: roughly, y- is added to nouns denoting fish, n- is added to nouns denoting other animals, and w- is added to nouns denoting various inanimates. The prefixes therefore identify ‘noun classes’ in a sense (although these are probably not noun classes in a strict sense because Dixon gives no indication that there are any agreement phenomena which involve them). The prefix ŋ- was only seen on a one word, /ɔ́jɟɔba/ ~ /ŋɔ́jɟɔba/ ‘wild yam’ and might be added to all nouns denoting fruits and vegetables, given that most Australian languages with noun classes have a noun class for fruits and vegetables, but there were no other such nouns in the dataset (Dixon only noticed the semantic conditioning after he left the field, so he didn’t have a chance to elicit any others). It must be emphasized, however, that these prefixes are entirely optional, and every noun which can have a prefix added to it can also be pronounced without the prefix. In addition some nouns, those denoting kin and body parts, appear to never take a prefix, although possibly this is just a limitation of the dataset given that their taking a prefix would be expected to be optional in any case. And words other than nouns, such as verbs, don’t take these prefixes at all.

Dixon hypothesizes that the y- and n- prefixes are reduced forms of /úyu/ ‘fish’ and /íɲa/ ‘animal’ respectively, while w- may be from /úgu/ ‘tree, wood’ or just an “unmarked” initial consonant (it’s not clear what Dixon means by this). These derivations are not unquestionable (for example, how do we get from /-ɲ-/ to /n-/ in the ‘animal’ prefix?) But it’s very plausible that the prefixes do originate in this way, even if the exact antedecent words are difficult to identify, because similar origins have been identified for noun class prefixes in other Australian languages (Dixon 1968, as cited by Dixon 1970). Just intuitively, it’s easy to see how nouns might come to be ever more frequently replaced by compounds of the dependent original noun and a term denoting a superset; cf. English koala ~ koala bear, oak ~ oak tree, gem ~ gemstone. In English these compounds are head-final but in other languages (e.g. Welsh) they are often head-initial, and presumably this would have to be the case in pre-Olgolo in order for the head elements to grammaticalize into noun class prefixes. The fact that the noun class prefixes are optional certainly suggests that the system is very much incipient, and still developing, and therefore of recent origin.

It might therefore be very interesting to see how the Olgolo language has changed after a century or so; we might be able to examine a noun class system as it develops in real time, with all of our modern equipment and techniques available to record each stage. It would also be very interesting to see how quickly this supposedly anomalous state of every word beginning with a vowel (in at least one of its freely-variant forms) is eliminated, especially since work on Australian language phonology since 1970 has established many other surprising findings about Australian syllable structure, including a language where the “basic’ syllable type appears to be VC rather than CV (Breen & Pensalfini 1999). Indeed, since Dixon wrote this paper 46 years ago Olgolo might have changed considerably already. Unfortunately, it might have changed in a somewhat more disappointing way. None of the citations of Dixon’s paper recorded by Google Scholar seem to examine Olgolo any further, and the documentation on Kunjen (the variety which includes Olgolo as a subvariety) recorded in the Australian Indigenous Languages Database isn’t particularly overwhelming. I can’t find a straight answer as to whether Kunjen is extinct today or not (never mind the Olgolo variety), but Dixon wasn’t optimistic about its future in 1970:

It would be instructive to study the development of Olgolo over the next few generations … Unfortunately, the language is at present spoken by only a handful of old people, and is bound to become extinct in the next decade or so.


Black, P. 1993 (post-print). Unusual syllable structure in the Kurtjar language of Australia. Retrieved from http://espace.cdu.edu.au/view/cdu:42522 on 26 September 2016.

Breen, G. & Pensalfini, R. 1999. Arrernte: A Language with No Syllable Onsets. Linguistic Inquiry 30 (1): 1-25.

Dixon, R. M. W. 1968. Noun Classes. Lingua 21: 104-125.

Dixon, R. M. W. 1970. Olgolo Syllable Structure and What They Are Doing about It. Linguistic Inquiry 1 (2): 273-276.

The insecurity of relative chronologies

One of the things historical linguists do is reconstruct relative chronologies: statements about whether one change in a language occurred before another change in the language. For example, in the history of English there was a change which raised the Middle English (ME) mid back vowel /oː/, so that it became high /uː/: boot, pronounced /boːt/ in Middle English, is now pronounced /buːt/. There was also a change which caused ME /oː/ to be reflected as short /ʊ/ before /k/ (among other consonants), so that book is now pronounced as /bʊk/. There are two possible relative chronologies of these changes: either the first happens before the second, or the second happens before the first. Now, because English has been well-recorded in writing for centuries, because these written records of the language often contain phonetic spellings, and because they also sometimes communicate observations about the language’s phonetics, we can date these changes quite precisely. The first probably began in the thirteenth century and continued through the fourteenth, while the second took place in the seventeenth century (Minkova 2015: 253-4, 272). In this particular case, then, no linguistic reasoning is needed to infer the relative chronology. But much of if not most of the time in historical linguistics, we are not so lucky, and are dealing with the history of languages for which written records in the desired time period are much less extensive, or completely nonexistent. Relative chronologies can still be inferred under these circumstances; however, it is a methodologically trickier business. In this post, I want to point out some complications associated with inferring relative chronologies under these circumstances which I’m not sure historical linguists are always aware of.

Let’s begin by thinking again about the English example I gave above. If English was an unwritten language, could we still infer that the /oː/ > /uː/ change happened before the /oː/ > /ʊ/ change? (I’m stating these changes as correspondences between Middle English and Modern English sounds—obviously if /oː/ > /uː/ happened first then the second change would operate on /uː/ rather than /oː/.) A first answer might go something along these lines: if the /oː/ > /uː/ change in quality happens first, then the second change is /uː/ > /ʊ/, so it’s one of quantity only (long to short). On the other hand, if /oː/ > /ʊ/ happens first we have a shift of both quantity and quality at the same time, followed by a second shift of quality. The first scenario is simpler, and therefore more likely.

Admittedly, it’s only somewhat more likely than the other scenario. It’s not absolutely proven to be the correct one. Of course we never have truly absolute proofs of anything, but I think there’s a good order of magnitude or so of difference between the likelihood of /oː/ > /uː/ happening first, if we ignore the evidence of the written records and accept this argument, and the likelihood of /oː/ > /uː/ happening first once we consider the evidence of the written records.

But in fact we can’t even say it’s more likely, because the argument is flawed! The /uː/ > /ʊ/ would involve some quality adjustment, because /ʊ/ is a little lower and more central than /uː/.[1] Now, in modern European languages, at least, it is very common for minor quality differences to exist between long and short vowels, and for lengthening and shortening changes to involve the expected minor shifts in quality as well (if you like, you can think of persistent rules existing along the lines of /u/ > /ʊ/ and /ʊː/ > /uː/, which are automatically applied after any lengthening or shortening rules to “adjust” their outputs). We might therefore say that this isn’t really a substantive quality shift; it’s just a minor adjustment concomitant with the quality shift. But sometimes, these quality adjustments following lengthening and shortening changes go in the opposite direction than might be expected based on etymology. For example, when /ʊ/ was affected by open syllable lengthening in Middle English, it became /oː/, not /uː/: OE wudu > ME wood /woːd/. This is not unexpected, because the quality difference between /uː/ and /ʊ/ is (or, more accurately, can be) such that /ʊ/ is about as close in quality to /oː/ as it is to /uː/. Given that /ʊ/ could lengthen into /oː/ in Middle English, it is hardly unbelievable that /oː/ could shorten into /ʊ/ as well.

I’m not trying to say that one should go the other way here, and conclude that /oː/ > /ʊ/ happened first. I’m just trying to argue that without the evidence of the written records, no relative chronological inference can be made here—not even an insecure-but-best-guess kind of relative chronological inference. To me this is surprising and somewhat disturbing, because when I first started thinking about it I was convinced that there were good intrinsic linguistic reasons for taking the /oː/ > /uː/-first scenario as the correct one. And this is something that happens with a lot of relative chronologies, once I start thinking about them properly.

Let’s now go to an example where there really is no written evidence to help us, and where my questioning of the general relative-chronological assumption might have real force. In Greek, the following two very well-known generalizations about the reflexes of Proto-Indo-European (PIE) forms can be made:

  1. The PIE voiced aspirated stops are reflected in Greek as voiceless aspirated stops in the general environment: PIE *bʰéroh2 ‘I bear’ > Greek φέρω, PIE *dʰéh₁tis ‘act of putting’ > Greek θέσις ‘placement’, PIE *ǵʰáns ‘goose’ > Greek χήν.
  2. However, in the specific environment before another PIE voiced aspirated stop in the onset of the immediately succeeding syllable, they are reflected as voiceless unaspirated stops: PIE *bʰeydʰoh2 ‘I trust’ > Greek πείθω ‘I convince’, PIE *dʰédʰeh1mi ‘I put’ > Greek τίθημι. This is known as Grassman’s Law. PIE *s (which usually became /h/ elsewhere) is elided in the same environment: PIE *segʰoh2 ‘I hold’ > Greek ἔχω ‘I have’ (note the smooth breathing diacritic).

On the face of it, the fact that Grassman’s Law produces voiceless unaspirated stops rather than voiced ones seems to indicate that it came into effect only after the sound change that devoiced the PIE voiced aspirated stops. For otherwise, the deaspiration of these voiced aspirated stops due to Grassman’s Law would have produced voiced unaspirated stops at first, and voiced unaspirated stops inherited from PIE, as in PIE *déḱm̥ ‘ten’ > Greek δέκα, were not devoiced.

However, if we think more closely about the phonetics of the segments involved, this is not quite as obvious. The PIE voiced aspirated stops could surely be more accurately described as breathy-voiced stops, like their presumed unaltered reflexes in modern Indo-Aryan languages. Breathy voice is essentially a kind of voice which is closer to voicelessness than voice normally is: the glottis is more open (or less tightly closed, or open at one part and not at another part) than it is when a modally voiced sound is articulated. Therefore it does not seem out of the question for breathy-voiced stops to deaspirate to voiceless stops if they are going to be deaspirated, in a similar manner as ME /ʊ/ becoming /oː/ when it lengthens. Granted, I don’t know of any attested parallels for such a shift. And in Sanskrit, in which a version of Grassman’s Law also applies, breathy-voiced stops certainly deaspirate to voiced stops: PIE *dʰédʰeh1mi ‘I put’ > Sanskrit dádhāmi. So the Grassman’s Law in Greek certainly has to be different in nature (and probably an entirely separate innovation) from the Grassman’s Law in Sanskrit.[2]

Another example of a commonly-accepted relative chronology which I think is highly questionable is the idea that Grimm’s Law comes into effect in Proto-Germanic before Verner’s Law does. To be honest, I’m not really sure what the rationale is for thinking this in the first place. Ringe (2006: 93) simply asserts that “Verner’s Law must have followed Grimm’s Law, since it operated on the outputs of Grimm’s Law”. This is unilluminating: certainly Verner’s Law only operates on voiceless fricatives in Ringe’s formulation of it, but Ringe does not justify his formulation of Verner’s Law as applying only to voiceless fricatives. In general, sound changes will appear to have operated on the outputs of a previous sound change if one assumes in the first place that the previous sound change comes first: the key to justifying the relative chronology properly is to think about what alternative formulations of each sound change are required in order to make the alternative chronology (such alternative formulations can almost always be formulated), and establish the high relative unnaturalness of the sound changes thus formulated compared to the sound changes as formulable under the relative chronology which one wishes to justify.

If the PIE voiceless stops at some point became aspirated (which seems very likely, given that fricativization of voiceless stops normally follows aspiration, and given that stops immediately after obstruents, in precisely the same environment that voiceless stops are unaspirated in modern Germanic languages, are not fricativized), then Verner’s Law, formulated as voicing of obstruents in the usual environments, followed by Grimm’s Law formulated in the usual manner, accounts perfectly well for the data. A Wikipedia editor objects, or at least raises the objection, that a formulation of the sound change so that it affects the voiceless fricatives, specifically, rather than the voiceless obstruents as a whole, would be preferable—but why? What matters is the naturalness of the sound change—how likely it is to happen in a language similar to the one under consideration—not the sizes of the categories in phonetic space that it refers to. Some categories are natural, some are unnatural, and this is not well correlated with size. Both fricatives and obstruents are, as far as I am aware, about equally natural categories.

I do have one misgiving with the Verner’s Law-first scenario, which is that I’m not aware of any attested sound changes involving intervocalic voicing of aspirated stops. Perhaps voiceless aspirated stops voice less easily than voiceless unaspirated stops. But Verner’s Law is not just intervocalic voicing, of course: it also interacts with the accent (precisely, it voices obstruents only after unaccented syllables). If one thinks of it as a matter of the association of voice with low tone, rather than of lenition, then voicing of aspirated stops might be a more believable possibility.

My point here is not so much about the specific examples; I am not aiming to actually convince people to abandon the specific relative chronologies questioned here (there are likely to be points I haven’t thought of). My point is to raise these questions in order to show at what level the justification of the relative chronology needs to be done. I expect that it is deeper than many people would think. It is also somewhat unsettling that it relies so much on theoretical assumptions about what kinds of sound changes are natural, which are often not well-established.

Are there any relative chronologies which are very secure? Well, there is another famous Indo-European sound law associated with a specific relative chronology which I think is secure. This is the “law of the palatals” in Sanskrit. In Sanskrit, PIE *e, *a and *o merge as a; but PIE *k/*g/*gʰ and *kʷ/*gʷ/*gʷʰ are reflected as c/j/h before PIE *e (and *i), and k/g/gh before PIE *a and *o (and *u). The only credible explanation for this, as far as I can see, is that an earlier sound change palatalizes the dorsal stops before *e and *i, and then a later sound change merges *e with *a and *o. If *e had already merged with *a and *o by the time the palatalization occurred, then the palatalization would have to occur before *a, and it would have to be sporadic: and sporadic changes are rare, but not impossible (this is the Neogrammarian hypothesis, in its watered-down form). But what really clinches it is this: that sporadic change would have to apply to dorsal stops before a set of instances of *a which just happened to be exactly the same as the set of instances of *a which reflect PIE *e, rather than *a or *o. This is astronomically unlikely, and one doesn’t need any theoretical assumptions to see this.[3]

Now the question I really want to answer here is: what exactly are the relevant differences in this relative chronology that distinguish it from the three more questionable ones I examined above, and allow us to infer it with high confidence (based on the unlikelihood of a sporadic change happening to appear conditioned by an eliminated contrast)? It’s not clear to me what they are. Something to do with how the vowel merger counterbleeds the palatalization? (I hope this is the correct relation. The concepts of (counter)bleeding and (counter)feeding are very confusing for me.) But I don’t think this is referring to the relevant things. Whether two phonological rules / sound changes (counter)bleed or (counter)feed each other is a function of the natures of the phonological rules / sound changes; but when we’re trying to establish relative chronologies we don’t know what the natures of the phonological rules / sound changes are! That has to wait until we’ve established the relative chronologies. I think that’s why I keep failing to compute whether there is also a counterbleeding in the other relative chronologies I talked about above: the question is non-well-formed. (In case you can’t tell, I’m starting to mostly think aloud in this paragraph.) What we do actually know are the correspondences between the mother language and the daughter language[4], so an answer to the question should state it in terms of those correspondences. Anyway, I think it is best to leave it here, for my readers to read and perhaps comment with their ideas, providing I’ve managed to communicate the question properly; I might make another post on this theme sometime if I manage to work out (or read) an answer that satisfies me.

Oh, but one last thing: is establishing the security of relative chronologies that important? I think it is quite important. For a start, relative chronological assumptions bear directly on assumptions about the natures of particular sound changes, and that means they affect our judgements of which types of sound changes are likely and which are not, which are of fundamental importance in historical phonology and perhaps of considerable importance in non-historical phonology as well (under e.g. the Evolutionary Phonology framework of Blevins 2004).[5] But perhaps even more importantly, they are important in establishing genetic linguistic relationships. Ringe & Eska (2014) emphasize in their chapter on subgrouping how much less likely it is for languages to share the same sequence of changes than the same unordered set of changes, and so how the establishment of secure relative chronologies is our saving grace when it comes to establishing subgroups in cases of quick diversification (where there might be only a few innovations common to a given subgroup). This seems reasonable, but if the relative chronologies are insecure and questionable, we have a problem (and the sequence of changes they cite as establishing the validity of the Germanic subgroup certainly contains some questionable relative chronologies—for example they have all three parts of Grimm’s Law in succession before Verner’s Law, but as explained above, Verner’s Law could have come before Grimm’s; the third part of Grimm’s Law may also have not happened separately from the first).

[1] This quality difference exists in present-day English for sure—modulo secondary quality shifts which have affected these vowels in some accents—and it can be extrapolated back into seventeenth-century English with reasonable certainty using the written records. If we are ignoring the evidence of the written records, we can postulate that the quality differentiation between long /uː/ and short /ʊ/ was even more recent than the /uː/ > /ʊ/ shift (which would now be better described as an /uː/ > /u/ shift). But the point is that such quality adjustment can happen, as explained in the rest of the paragraph.

[2] There is a lot of literature on Grassman’s Law, a lot of it dealing with relative chronological issues and, in particular, the question of whether Grassman’s Law can be considered a phonological rule that was already present in PIE. I have no idea why one would want to—there are certainly PIE forms inherited in Germanic that appear to have been unaffected by Grassman’s Law, as in PIE *bʰeydʰ- > English bide; but I’ve hardly read any of this literature. My contention here is only that the generally-accepted relative chronology of Grassman’s Law and the devoicing of the PIE voiced aspirated stops can be contested.

[3] One should bear in mind some subtleties though—for example, *e and *a might have gotten very, very phonetically similar, so that they were almost merged, before the palatalization occured. If one wants to rule out that scenario, one has to appeal again to the naturalness of the hypothesized sound changes. But as long as we are talking about the full merger of *e and *a we can confidently say that it occurred after palatalization.)

[4] Actually, in practice we don’t know these with certainty either, and the correspondences we postulate to some extent are influenced by our postulations about the natures of sound changes that have occurred and their relative chronologies… but I’ve been assuming they can be established more or less independently throughout these posts, and that seems a reasonable assumption most of the time.

[5] I realize I’ve been talking about phonological changes throughout this post, but obviously there are other kinds of linguistic changes, and relative chronologies of those changes can be established too. How far the discussion in this post applies outside of the phonological domain I will leave for you to think about.


Blevins, J. 2004. Evolutionary phonology: The emergence of sound patterns. Cambridge University Press.

Minkova, D. 2013. A historical phonology of English. Edinburgh University Press.

Ringe, D. 2006. A linguistic history of English: from Proto-Indo-European to Proto-Germanic. Oxford University Press.

Ringe, D. & Eska, J. F. 2013. Historical linguistics: toward a twenty-first century reintegration. Cambridge University Press.