The difference between key-value containers and functions

I was wondering why all of the programming languages I know treat key-value containers (i.e. sequences and associative arrays) differently from functions. From mathematics, I’m used to thinking of key-value containers as functions of one argument, taking a key as input and giving a value as output. Treating key-value containers as functions would have some nice effects, such as allowing the free use of infinite containers and allowing the use of function composition in place of the higher-order function normally called map.

Here’s my guess: one major difference between key-value containers and functions in most programming languages is that the values in containers are precomputed, while the values of functions are computed only when the function is called on the argument. This can be illustrated if we try using function composition in place of map in Racket:

(define (my-map f xs)
  (lambda (i)
    (f (xs i))))

(define (my-sequence i)
  (match i
    (0 1)
    (1 4)
    (2 9)))

(define (write-and-square x)
  (writeln x)
  (* x x))

(writeln "squared-list:")

; displays 1, 4 and 9
(define squared-list (map write-and-square (list 1 4 9)))

(writeln "my-sequence-squared created:")

; nothing displayed
(define my-sequence-squared (my-map write-and-square my-sequence))

(writeln "my-sequence-squared accessed:")

(writeln (my-sequence-squared 0)) ; displays 1 and 1
(writeln (my-sequence-squared 1)) ; displays 4 and 16
(writeln (my-sequence-squared 2)) ; displays 9 and 81

When we use Racket’s map function to define squared-list, the members of the list have write-and-square applied right away, and we see the side-effect of the function. But when we use our my-map function, the evaluation of write-and-square on a given key is delayed until the value the key maps to is actually requested. This is a sort of lazy evaluation. Although lazy evaluation has its uses, it can lead to inefficiency if a key is accessed repeatedly and the value it maps to is fully recomputed again and again needlessly without a cache being stored in memory.

To deal with this, we can use memoization. However, since an arbitrary function may have an infinite domain, we have to choose which values to memoize. Here’s a Racket function for carrying out the memoization (using some eval black magic to dynamically write a pattern match):

(define (memoize mapping keys)
  (define cases (map (lambda (key)
    (list key (mapping key))) keys))

  (lambda (key)
    (eval (append
      (list 'match key)
      (list (list '_ (list mapping key)))))))

And here’s a test showing the function has the expected behaviour.

(writeln "my-sequence-squared-2 created:")

; displays nothing
(define my-sequence-squared-2 (my-map write-and-square my-sequence))

(writeln "my-sequence-squared-3 created:")

; displays 1 and 4
(define my-sequence-squared-3 (memoize my-sequence-squared-2 (list 0 1)))

(writeln "my-sequence-squared-3 accessed:")

(my-sequence-squared-3 1) ; displays 16
(my-sequence-squared-3 2) ; displays 9 and 81

Now, for functions with finite domain, a natural default is to memoize all of the values. But how do we do that? Given an arbitrary function, we don’t know whether its domain is finite, and even if its domain is finite, we don’t know how to iterate through all of the domain’s members. So we can’t just add some extra code in my-map saying something like “if the function has a finite domain, memoize each of the values of the composition”. Our options are to either require users of my-map to explicitly memoize whatever specific values need memoizing, or add some extra information to the functions themselves specifying whether they have a finite domain and how they can be iterated over, which my-map can then take advantage of.

A natural way to supply this information is to use objects of a distinct type, a “container type”, instead of functions on finite domains. Any object with a container type can be assumed to be finite. Furthermore, we can have multiple container types, each associated with their own iteration protocol.

This, then, is the purpose, or at least a purpose, of having containers as a distinct concept from functions. Containers are functions together with the necessary additional information required to memoize all of their values behind the scenes whenever they are composed.

Source code for this blog post (note: the last couple of lines give an error when the file is run in DrRacket, but if you type them into the REPL after running the rest of the program they work fine, so I assume the problem is DrRacket’s and not mine 🙂 )


Reversing a linked list in place

The other day I finally realized how you reverse a linked list in place.

I had attempted to tackle this problem several times before, and had always ended up looking up the answer online, copying and pasting some code given in the answer and seeing that it worked, but not really understanding how it worked, as evidenced by my failing to remember how to do it the next time. Now, however, I’m confident that I can remember it for the long term.

It’s remarkable how unhelpful the online resources were here. Consider this description of the algorithm from, which was the first result Google gave me for “reversing a linked list” as I was writing this post:

  1. Initialize three pointers prev as NULL, curr as head and next as NULL.

  2. Iterate trough the linked list. In loop, do following.
    // Before changing next of current,
    // store next node
    next = curr->next
    // Now change next of current
    // This is where actual reversing happens
    curr->next = prev
    // Move prev and curr one step forward
    prev = curr
    curr = next

This is a particularly bad explanation—it’s not actually complete, and even the spelling and grammar is poor—but in terms of how it describes the solution it’s not substantially different from the others I saw. It just says you have to use three pointers and do a particular sequence of reassignments until one of them is null. There is little explanation as to why this particular sequence of reassignments works. For the reader, it’s just a magic sequence that has to be memorized.

In order to understand anything complex, you have to use abstractions, to break the complexity into manageable chunks, and analogies, to allow you to apply your existing knowledge about other things. For me, the key to understanding how to reverse a linked list was realizing that what you do to the list in the loop above can be understood as simply carrying out the two operations below in succession:

  1. Remove the first item from the list.

  2. Attach this item to the front of a new list.

The processes of attachment and removal can be readily visualised, so they allow us to draw on our spatial intuition to reason about data structures. It’s obvious that if we keep removing items from the original list and attaching them to the new list until there are no items left to remove from the original list, then the new linked list will end up having the same items the original list originally had, but in reverse order. The new list can then be assigned to the variable holding the original list.

It’s really an incredibly straightforward algorithm. I think the reason it took me so long to figure it out was that I was disregarding all ideas for solutions that on a conceptual level involved moving items into a new list and then reassigning the variable. This was a case where spatial intuition was leading me astray: such solutions are conceptually not “in-place”, so I assumed any such solution would have be actually not in-place; an amount of space proportional to the size of the list would need to be used in order to store the new list. Of course, I was forgetting that you can remove items from the original list as you attach them to the new list, so that no additional space needs to be used that wasn’t already being used to store the original list.

Here’s the code I ended up writing (in C):

#include <stdlib.h>

struct node {
    int value;
    struct node *next;

void list_push(struct node **list, struct node *node) {
    node->next = *list;
    *list = node;

struct node *list_pop(struct node **list) {
    struct node *node = *list;
    *list = node->next;
    return node;

void list_reverse(struct node **list) {
    struct node *new_list = NULL;
    while (*list)
        list_push(&new_list, list_pop(list));
    *list = new_list;

Note that if you inline the list_push and list_pop functions in list_reverse, you get what is essentially the loop with three pointers that all the online resources were talking about:

void list_reverse(struct list **list) {
    struct list *new_list = NULL;
    while (*list) {
        struct list *node = *list;
        *list = node->next;
        node->next = new_list;
        new_list = node;
    *list = new_list;

Only the variable names are different: list, node and new_list correspond to the variables next, curr and prev respectively in the description.

There are of course different ways of conceptualizing what happens during the loop. The conceptualization described above, where we move the items into a new list by attaching to the front, is just the one that came most naturally to me. From the variable names the writer of that article used, it’s evident that their conceptualization was different—prev doesn’t make a lot of sense as a name for a new container we’re placing items into.

Now that I look at it again, I see that the article includes an animation which illustrates the conceptualization they were probably using. You can think of the linked list as a bunch of items connected by arrows, where the arrows represent pointers. To reverse the list, you just flip the arrows!

Once you conceptualize the problem in this way, I think it’s fairly straightforward to work out how to do it. You have to iterate over the list, keeping both the current node and the previous node in memory so that you can point the current node to the previous one. The previous one either will have been already set to point to one before it, or is the first node in the list, in which case it needs to be pointed to NULL. We’ll also need make temporary copies of what the current node was originally pointing to on each iteration, so we can move on to the next node even after we change what the current node is pointing to. So with this conceptualization, we can easily see that we need three pointers, and that prev, curr and next are appropriate names for them.

void list_reverse(struct list **list) {
    struct list *prev = NULL;
    struct list *curr = *list;
    while (curr) {
        struct list *next = curr->next;
        curr->next = prev;
        prev = curr;
        curr = next;

This makes it more clear where the article was coming from. To its writer, the sequence of assignments in the loop probably seemed like a simple atomic operation—“make the current item point to the previous one”. But the writer failed to explicitly point out that they were making use of this conceptualization, and the clarity of their explanation for a beginner suffered because of it.

It’s likely that they weren’t particularly aware that they were relying on this particular conceptualization, or that there were alternatives. I think this is a general problem with trying to explain things to other people. People tend to stick with the first conceptualization of a problem that proves helpful for them, and think only in those terms. Even if they are aware of alternative conceptualizations, they may think of those conceptualizations as misunderstandings of the problem, when it’s really just a case of different conceptualizations having their own advantages and disadvantages and suiting different thinking styles. Even if some conceptualizations are less helpful than others, it may be best to allow a beginner to work with the conceptualization they find most natural as a starting point.

Anyway, all this has got me curious—if there is anybody still reading who has ever tried to reverse a linked list, how do you prefer to conceptualize the linked list reversal problem? Moving items, or flipping pointers, or something else? Leave a comment—I’d be interested to hear your answer.

Voles and Orkney

What do voles and Orkney have to do with one another? One thing somebody knowledgeable about British wildlife might be able to tell you is that Orkney is home to a unique variety of the common European vole (Microtus arvalis) called the Orkney vole.

The most remarkable thing about the Orkney vole is that the common European vole isn’t found anywhere else in the British Isles, nor in Scandinavia—it’s a continental European animal. That raises the question of how a population of them ended up in Orkney. During the last ice age, Orkney was covered by a glacier and would have been uninhabitable by voles; and after the ice retreated, Orkney was separated from Great Britain straight away; there were never any land bridges that would have allowed voles from Great Britain to colonize Orkney. Besides, there is no evidence that M. arvalis was ever present on Great Britain, nor is there any evidence that voles other than M. arvalis were ever present on Orkney; none of the three species that inhabit Great Britain today (the field vole, Microtus agrestis, the bank vole, Myodes glareolus, and the water vole, Arvicola amphibius) were able to colonize Orkney, even though they were able to colonize some islands that were originally connected to Great Britain by land bridges (Haynes, Jaarola & Searle, 2003). The only plausible hypothesis is that the Orkney voles were introduced into Orkney by humans.

But if the Orkney voles were introduced, they were introduced at a very early date—the earliest discovered Orkney vole remains have been carbon-dated to ca. 3100 BC (Martínkova et al., 2013)—around the same time Skara Brae was first occupied, to put that in context. The only other mammals on the British Isles known to have been introduced at a similarly ancient date or earlier are the domestic dog and the domestic bovines (cattle, sheep, goats)—even the house mouse is not known to have been present before c. 500 BC (Montgomery, 2014)! The motivation for the introduction remains mysterious—voles might have been transported accidentally in livestock fodder imported from the Continent, or they might have been deliberately introduced as pets, food sources, etc.; we can only speculate. It’s interesting to note that the people of Orkney at this time seem to have been rather influential, as they introduced the Grooved Ware pottery style to other parts of the British Isles.

Anyway, there is in fact another interesting connection between voles and Orkney, which has to do with the word ‘vole’ itself. Something you might be aware of if you’ve looked at old books on British wildlife is that ‘vole’ is kind of a neologism. Traditionally, voles were not thought of as a different sort of animal from mice and rats. The relatively large animal we usually call the water vole today, Arvicola amphibius, was called the ‘water rat’ (as it still is sometimes today), or less commonly the ‘water mouse’. The smaller field vole, Microtus agrestis, was often just the ‘field mouse’, not distinguished from Apodemus sylvaticus, although it was sometimes distinguished as the ‘water mouse’ or the ‘short-tailed field mouse’ (as opposed to the ‘long-tailed field mouse’ A. sylvaticus—if you’ve ever wondered why people still call A. sylvaticus the ‘long-tailed field mouse’, even though its tail isn’t much longer than that of other British mice, that’s probably why!) The bank vole, Myodes glareolus, seems not to have been distinguished from the field vole before 1832 (the two species are similar in appearance, one distinction being that whereas the bank vole’s tail is about half its body length, the field vole’s tail is about 30% to 40% of its body length).

As an example, a reference to a species of vole as a ‘mouse’ can be found in the 1910 edition of the Encyclopedia Britannica:

The snow-mouse (Arvicola nivalis) is confined to the alpine and snow regions. (vol. 1, p. 754, under “Alps”)

Today that would be ‘the snow vole (Chionomys nivalis)’.

A number of other small British mammals were traditionally subsumed under the ‘mouse’ category, namely:

  • Shrews, which were often referred to as shrewmice from the 16th to the 19th centuries, although ‘shrew’ on its own is the older word (it is attested in Old English, but its ultimate origin is unknown).
  • Bats, which in older language could also be referred to by a number of whimsical compound words, the oldest and most common being rearmouse, from a now-obsolete verb meaning ‘stir’, but also rattlemouse, flindermouse, flickermouse, flittermouse and fluttermouse. The word rearmouse is still used today in the strange language of heraldry.
  • And, of course, dormice, which are still referred to by a compound ending in ‘-mouse’, although we generally don’t think of them as true mice today. The origin of the ‘dor-‘ prefix is uncertain; the word is attested first in c. 1425. There was an Old English word sisemūs for ‘dormouse’ whose origins are similarly mysterious, but the -mūs element is clearly ‘mouse’.

There is still some indeterminacy about the boundaries of the ‘mouse’ category when non-British rodent species are included: for example, are birch mice mice?

So, where did the word ‘vole’ come from? Well, according to the OED, it was first used in a book called History of the Orkney Islands (available from, published in 1805 and written by one George Barry, who was not a native of Orkney but a minister who preached there. In a list of the animals that inhabit Orkney, we find the following entry (alongside entries for the Shrew Mouse ſorex araneus, the [unqualified] Mouse mus muſculus, and the [unqualified] Field Mouse mus sylvaticus):

The Short-tailed Field Mouse, (mus agreſtis, Lin. Syſt.) which with us has the name of the vole mouſe, is very often found in marſhy grounds that are covered with moſs and ſhort heath, in which it makes roads or tracks of about three inches in breadth, and ſometimes miles in length, much worn by continual treading, and warped into a thouſand different directions. (p. 320)

So George Barry knew vole mouse as the local, Orkney dialectal word for the Orkney vole, which he was used to calling a ‘short-tailed field mouse’ (evidently he wasn’t aware that the Orkney voles were actually of a different species from the Scottish M. agrestis—I don’t know when the Orkney voles’ distinctiveness was first identified). Now, given that vole mouse was an Orkney dialect word, its further etymology is straightforward: the vole element is from Old Norse vǫllr ‘field’ (cf. English wold, German Wald ‘forest’), via the Norse dialect once spoken in Orkney and Shetland (sometimes known as ‘Norn’). So the Norse, like the English, thought of voles as ‘field mice’. The word vole is therefore the only English word I know, that isn’t about something particularly to do with Orkney or Shetland, that has been borrowed from Norn.

Of course, Barry only introduced vole mouse as a Orcadianism; he wasn’t proposing that the word be used to replace ‘short-tailed field mouse’. The person responsible for that seems to have been the author of the next quotation in the OED, from an 1828 book titled A History of British Animals by University of Edinburgh graduate John Fleming (available from On p. 23, under an entry for the genus Arvicola, Fleming notes that

The species of this genus differ from the true mice, with which the older authors confounded them, by the superior size of the head, the shortness of the tail, and the coarseness of the fur.

He doesn’t explain where he got the name vole from, nor does he seem to reference Barry’s work at all, but he does list alternative common names of each of the two vole species he identifies. The species Arvicola aquatica, which he names the ‘Water Vole’ for the first time, is noted to also be called the ‘Water Rat’, ‘Llygoden y dwfr’ (in Welsh) or ‘Radan uisque’ (in Scottish Gaelic). The species Arvicola agrestis, which he names the ‘Field Vole’ for the first time, is noted to be also called the ‘Short-tailed mouse’, ‘Llygoden gwlla’r maes’ (in Welsh), or “Vole-mouse in Orkney”.

Fleming also separated the shrews, bats and dormice from the true mice, thus establishing division of the British mammals into basic one-word-labelled categories that we are familiar with today. With respect to the other British mammals, the naturalists seem to have found the traditional names to be sufficiently precise: for example, each of the three quite similar species of the genus Mustela has its own name—M. erminea being the stoat, M. nivalis being the weasel, and M. putorius being the polecat.

Fleming still didn’t distinguish the field vole and the bank vole; that innovation was made by one Mr. Yarrell in 1832, who exhibited specimens of each to the Zoological Society, demonstrated their distinctiveness and gave the ‘bank vole’ (his coinage) the Latin name Arvicola riparia. It was later found that the British bank vole was the same species as a German one described by von Schreber in 1780 as Clethrionomys glareolus, and so that name took priority (and just recently, during the 2010s, the name Myodes has come to be favoured for the genus over Clethrionomys—I don’t know why exactly).

In the report of Yarrell’s presentation in the Proceedings of the Zoological Society the animals are referred to as the ‘field Campagnol‘ and ‘bank Campagnol‘, so the French borrowing campagnol (‘thing of the field’, still the current French word for ‘vole’) seems to have been favoured by some during the 19th century, although Fleming’s recognition of voles as distinct from mice was universally accepted. The word ‘vole’ was used by other authors such as Thomas Bell in A History of British Quadrupeds including the Cetacea (1837), and eventually the Orcadian word seems to have prevailed and entered ordinary as well as naturalists’ usage.


Haynes, S., Jaarola, M., & Searle, J. B. (2003). Phylogeography of the common vole (Microtus arvalis) with particular emphasis on the colonization of the Orkney archipelago. Molecular Ecology, 12, 951–956.

Martínkova, N., Barnett, R., Cucchi, T., Struchen, R., Pascal, M., Pascal, M., Fischer, M. C., Higham, T., Brace, S., Ho, S. Y. W., Quéré, J., O’Higgins, P., Excoffier, L., Heckel, G., Rus Hoelzel, A., Dobney, K. M., & Searle, J. B. (2013). Divergent evolutionary processes associated with colonization of offshore islands. Molecular Ecology, 22, 5205–5220.

Montgomery, W. I., Provan, J., Marshal McCabe, A., & Yalden, D. W. (2014). Origin of British and Irish mammals: disparate post-glacial colonisation and species introductions. Quaternary Science Reviews, 98, 144–165.

Truth-uncertainty and meaning-uncertainty

Epistemic status: just a half-baked idea, which ought to be developed into something more complete, but since I’m probably not going to do that anytime soon I figured I’d publish it now just to get it out there.

Consider a statement such as (1) below.

(1) Cats are animals.

I’m used to interpreting statements such as (1) using a certain method which I’m going to call the “truth-functional method”. Its key characteristic is, as suggested by the name, that statements are supposed to be interpreted as truth functions, so that a hypothetical being which knew everything (had perfect information) would be able to assign a truth value—true or false—to every statement. There are two problems which prevent truth values being assigned straightforwardly to statements in practice.

The first is that nobody has perfect information. There is always some uncertainty of the sort which I’m going to call “truth-uncertainty”. Therefore, it’s often (or maybe even always) impossible to determine a statement’s truth value exactly. All one can do is have a “degree of belief” in the statement, though this degree of belief may be meaningfully said to be “close to truth” or “close to falsth1” or equally far from both. People disagree about how exactly degrees of belief should be thought about, but there’s a very influential school of thought (the Bayesian school of thought) which holds that degrees of belief are best thought about as probabilities, obeying the laws of probability theory. So, for a given statement and a given amount of available information, the goal for somebody practising the truth-functional method is to assign a degree of belief to the statement. At least inside the Bayesian school, there has been a lot of thought about how this process should work, so that truth-uncertainty is the relatively well-understood sort of uncertainty.

But there’s a second problem, which is that often (maybe even always) it’s unclear exactly what the statement means. To be more exact (the preceding sentence was an exemplification of itself), when you hear a statement, it’s often unclear exactly which truth function the statement is supposed to be interpreted as; and depended on which truth function it’s interpreted as, the degree of belief you assign to it will be different. This is the problem of meaning-uncertainty, and it seems to be rather less well-understood. Indeed, it’s probably not conventional to think about it as an uncertainty problem at all in the same way as truth-uncertainty. In the aforementioned scenario where you hear the statement carrying the meaning-uncertainty being made by somebody else, the typical reponse is to ask the statement-maker to clarify exactly what they mean (to operationalize, to use the technical term). There is of course an implicit assumption here that the statement-maker will always have a unique truth-function in their mind when they make their statement; meaning-uncertainty is a problem that exists only on the receiving end, due to imperfect linguistic encoding. If the statement-maker doesn’t have a unique truth function in mind, and they don’t care to invent one, then their statement is taken as content-free, and not engaged with.

I wonder if this is the right approach. My experience is that meaning-uncertainty exists not only on the recieving end, but also very much on the sending end too; I very often find myself saying things but not knowing quite what I would mean by them, but nevertheless feeling that they ought to be said, that making these statements does somehow contribute to the truth-seeking process. Now I could just be motivatedly deluded about the value of my utterances, but let’s run with the thought. One thing that makes me particularly inclined towards this stance is that sometimes I find myself resisting operationalizing my statements, like there’s something crucial being lost when I operationalize and restrict myself to just one truth function. If you draw the analogy with truth-uncertainty, operationalization is like just saying whether a statement is true or false, rather than giving the degree of belief. Now one of the great virtues of the Bayesian school of thought (although it would be shared by any similarly well-developed school of thought on what degrees of belief are exactly) is arguably that, by making it more clear exactly what degrees of belief are, it seems to make people a lot more comfortable with thinking about degrees of belief rather than just true vs. false, and thus dealing with truth-uncertainty. Perhaps, then, what’s needed is some sort of well-developed concept of “meaning distributions”, analogous to degrees of belief, that will allow everybody to get comfortable dealing with meaning-uncertainty. Or perhaps this analogy is a bad one; that’s a possibility.

Aside 1. Just as truth-uncertainty almost always exists to some degree, I’m fairly sure meaning-uncertainty almost always exists to some degree; operationalization is never entirely completely done. There’s a lot of meaning-uncertainty in statement (1), for example, and it doesn’t seem to completely go away no matter how much you operationalize.

Aside 2. The concept of meaning-uncertainty doesn’t seem to be as necessarily tied up with the truth-functional model to me as that of truth-uncertainty; one can imagine statements being modelled as some other sort of thing, but you’d still have to deal with exactly which example of the other sort of thing any given statement was, so there’d still be meaning-uncertainty of a sort. For example, even if you don’t see ought-statements as truth-functional, as opposed to is-statements, you can still talk about the meaning-uncertainty of an ought-statement, if not its truth-uncertainty.

Aside 3. Another way of dealing with meaning-uncertainty might be to go around the problem, and interpret statements using something other than the truth-functional method.


^ I’m inventing this word by analogy with “truth” because I get fed up with always having to decide whether to use “falsehood” or “falsity”.

Selections from Tumblr

I have a Tumblr blog which I use for writing short-form things that aren’t necessarily of any lasting value. But occasionally things do end up there that might be worth reading, so I intend to make an organized list of links to Tumblr posts that might be interesting to readers of this blog every year or so. The last time I did this was in December 2015 (here on WordPress and here on Tumblr), and I have been posting on Tumblr at a higher rate since then, so the list in this post is rather long, and I’ve organized it into subsections to make it more manageable. Only posts from December 2015 are included; for earlier posts, see the earlier lists.


Theoretical linguistics

  1. On where my interests lie in linguistics
  2. On some nonsense about “recursion”
  3. A defense of Chomsky

Phonetics and phonology

  1. On homophone-substitution errors in writing
  2. Nonce phonemes (phonemes that only appear in one morpheme) in Mako, Latin, Bintulu and Amuzgo
  3. On random word generation for conlanging
  4. On learning to pronounce foreign sounds
  5. On the alterability of idiolect phonology
  6. On liaison phenomena in languages other than French (more)
  7. On aspiration
  8. On breathy voice and voiced aspiration
  9. On word-initial sonorant-obstruent clusters
  10. On Mandarin tones

Morphology and syntax

  1. On the isolating-agglutinative-fusional cycle (more)
  2. On weather verbs
  3. On case-marking prefixes
  4. On verbal argument indexing
  5. On inflecting nouns for person (more)
  6. On marking the positive rather than the negative
  7. On overt coding of animacy


  1. On presupposition accommodation in St’át’imcets
  2. On mereologies (prompted by this)
  3. On the denotation of plural nouns (more, yet more)
  4. On affective adpositions

English phonology and grammar

  1. On grammatical gender in English
  2. On the pronunciation of television
  3. On checked vowels and yeah
  4. On the pronunciation of err
  5. On the pronunciation of one
  6. On contraction of let us
  7. On the superiority of northern English accents
  8. On final st-reduction in the Caribbean
  9. My pronunciation of anxiety
  10. On breaking before /l/
  11. On circumstances in which contraction is impossible (more, more)
  12. On double /t/ in eighteen
  13. On /k/ in length and /t/ in prince
  14. A complicated English syntax question to do with possessive NPs and number agreement
  15. On smew and English phonotactics
  16. On not all and all … not
  17. On weird pronunciations of interjections
  18. On English dual pronouns (more)
  19. My English spelling reform proposal, and various anonymous message-sender’s proposals: 1, 2, 3, amusing follow-up
  20. On reduced vowels in my accent of English
  21. On English’s one geminate consonant
  22. Tense and lax short a in American English
  23. On the most distinctive characteristic of American English
  24. A discussion on the pronunciation of English short a
  25. On the English voiced and voiceless dental fricatives

Historical linguistics

  1. On the evolution of the French reflexive
  2. On a possible Hurro-Urartian-Etruscan clade
  3. On the grammaticalization of pronouns as copulas (more)
  4. On the weirdness of the Proto-Indo-European stop system
  5. On the early development of the Indo-Aryan languages
  6. On fronting of postalveolar sibilants
  7. On fricativization of voiceless unaspirated stops (more)
  8. On regular sound change’s capability to produce apparent suppletion
  9. A discussion on various Proto-Indo-European topics
  10. On metathesis of clusters with sibilants
  11. On Lydian (more)
  12. On the conditioning of changes in vowel height
  13. On the Proto-Indo-European velar series
  14. On possible unidirectionality of OV to VO word order shifts (more, yet more)
  15. On resources on Iranian historical phonology
  16. On dl variation in Latin
  17. On the existence of Proto-Anatolian

English historical phonology

  1. On irregularity in English historical phonology
  2. On it’s and ’tis
  3. On there being no /t/ in the word moisten
  4. On the digraph wh pronounced /h/ in English


  1. On Geordie
  2. On Latin trahere
  3. English place names including OE wicga ‘beetle’ as an element
  4. On hang (erratum)
  5. On cuckold and cuckoo (more)
  6. On sibling (more)
  7. On cognacy of ‘food’ and ‘corpse’
  8. On prince
  9. Obscure English words beginning with gn- (more)
  10. On ostrich
  11. On the alleged cognacy of child with a word for ‘vulva’
  12. The meaning of race
  13. On the city Ur and the prefix ur-


  1. On language’s untidiness
  2. On the femininity and masculinity of linguistic varieties
  3. On sign languages interacting with spoken languages
  4. The lewdest language
  5. On linguistic prescriptivism


  1. The pumping lemma as a statement about integers
  2. On continued fractions
  3. A theorem on rational approximation
  4. On non-intersective adjectives in mathematical terminology
  5. On Big-O and little-o notation
  6. Why you can’t integrate 1/x from -1 to 1
  7. On bets and probabilities
  8. On Riemann-Stieltjes integration by parts
  9. On dividing by zero
  10. On two meanings of ‘the’ in mathematical writing
  11. On a binomial coefficient identity
  12. On continuous entropy
  13. On generating terrain

Sociology and anthropology

  1. On sacrifice
  2. “High gods” in various societies
  3. A defense of Jared Diamond
  4. On the Lengua’s conceptions of personality (not my own writing)
  5. On the origin of religion
  6. On agnosticism/atheism in ancient India (not my own writing)
  7. On the development of kinship systems
  8. The secret girdles of Ainu women
  9. On Dene-Yeniseian and the settlement of the Americas
  10. On lactase persistence
  11. On the Ashvamedha (more)
  12. On the Solubba
  13. On “guess culture”


  1. The curious story of the inquisitive shrew-mole and roving economist Hayden Castagno
  2. On the Orkney vole (more)
  3. Comparison of the wild mammal fauna of Britain and America (plus responses to replies: 1, 2, 3, 4)


  1. On human nature and morality
  2. On Gongsun Long’s argument that people have three ears


  1. My (wrong) prediction on the EU referendum
  2. On the horribleness of American universities
  3. On the acceptability of not voting
  4. On Labor, the Conversatives and the Literal Democrats
  5. On the House of Lords
  6. Highlights from Wikipedia’s list of incidents of grave disorder in the House of Commons
  7. On British and European identity (more)


  1. On sea silk
  2. List of phrases of the form numeral-noun in Chinese history
  3. On Robert Tombs’ The English and their History
  4. YouTube videos on historical figures or events with Coldplay’s “Viva La Vida” as the background music

Art and literature

  1. On the meaning of songs
  2. On surrealism in art
  3. On V. S. Naipaul

Personal history

  1. On religious education in the UK
  2. On my local identity
  3. Dreams: 1, 2, 3, 4, 5
  4. On memories of dreams
  5. On historical events in my lifetime
  6. An important life update
  7. What I learnt from school history classes


  1. German folktales: 1, 2 (reply 1, reply 2), 3
  2. Tang Yang’s advice to the Song Emperor
  3. On dreams (notes from a lecture)
  4. On an interestingly bad joke
  5. On “anischerality”

Some of the phonological history of English vowels, illustrated by failed rhymes in English folk songs


  • ModE = Modern English (18th century–present)
  • EModE = Early Modern English (16th–17th centuries)
  • ME = Middle English (12th–15th centuries)
  • OE = Old English (7th–11th centuries)
  • OF = Old French (9th–14th centuries)

All of this information is from the amazingly comprehensive book English Pronunciation, 1500–1700 (Volume II) by E. J. Dobson, published in 1968, which I will unfortunately have to return to the library soon.

The transcriptions of ModE pronunciations are not meant to reflect any particular accent in particular but to provide enough information to allow the pronunciation in any particular accent to be deduced given sufficient knowledge about the accent.

I use the acute accent to indicate primary stress and the grave accent to indicate secondary stress in phonetic transcriptions. I don’t like the standard IPA notation.

Oh, the holly bears a blossom
As white as the lily flower
And Mary bore sweet Jesus Christ
To be our sweet saviour
— “The Holly and the Ivy”, as sung by Shirley Collins and the Young Tradition)

In ModE flower is [fláwr], but saviour is [séjvjər]; the two words don’t rhyme. But they rhymed in EModE, because saviour was pronounced with secondary stress on its final syllable, as [séjvjə̀wr], while flower was pronounced [flə́wr].

The OF suffix -our (often spelt -or in English, as in emperor and conqueror) was pronounced /-ur/; I don’t know if it was phonetically short or long, and I don’t know whether it had any stress in OF, but it was certainly borrowed into ME as long [-ùːr] quite regularly, and regularly bore a secondary stress. In general borrowings into ME and EModE seem to have always been given a secondary stress somewhere, in a position chosen so as to minimize the number of adjacent unstressed syllables in the word. The [-ùːr] ending became [-ə̀wr] by the Great Vowel Shift in EModE, and then would have become [-àwr] in ModE, except that it (universally, as far as I know) lost its secondary stress.

English shows a consistent tendency for secondary stress to disappear over time. Native English words don’t generally have secondary stress, and you could see secondary stress as a sort of protection against the phonetic degradation brought about by English’s native vowel reduction processes, serving to prevent the word from getting too dissimilar from its foreign pronunciation too quickly. Eventually, however, the word (or really suffix, in this case, since saviour, emperor and conqueror all develop in the same way) gets fully nativized, which means loss of the secondary stress and concomitant vowel reduction. According to Dobson, words probably acquired their secondary stress-less variants more or less immediately after borrowing if they were used in ordinary speech at all, but educated speech betrays no loss of secondary stress until the 17th century (he’s speaking generally here, not just about the [-ə̀wr] suffix. Disyllabic words were quickest to lose their secondary stresses, trisyllabic words (such as saviour) a bit slower, and in words with more than three syllables secondary stress often survives to the present day (there are some dialect differences, too: the suffix -ary, as in necessary, is pronounced [-ɛ̀ri] in General American but [-əri] in RP, and often just [-ri] in more colloquial British English).

The pronunciation [-ə̀wr] is recorded as late as 1665 by Owen Price (The Vocal Organ). William Salesbury (1547–1567) spells the suffix as -wr in Welsh orthography, which could reflect a pronunciation [-ùːr] or [-ur]; the former would be the result of occasional failure of the Great Vowel Shift before final [r] as in pour, tour, while the latter would be the probable initial result of vowel reduction. John Hart (1551–1570) has [-urz] in governors. So the [-ə̀wr] pronunciation was in current use throughout the 17th century, although the reduced forms were already being used occasionally in Standard English during the 16th. Exactly when [-ə̀wr] became obsolete, I don’t know (because Dobson doesn’t cover the ModE period).

Bold General Wolfe to his men did say
Come lads and follow without delay
To yonder mountain that is so high
Don’t be down-hearted
For we’ll gain the victory
— “General Wolfe” as sung by the Copper Family

Our king went forth to Normandy
With grace and might of chivalry
The God for him wrought marvelously
Wherefore England may call and cry
— “Agincourt Carol” as sung by Maddy Prior and June Tabor

This is another case where loss of secondary stress is the culprit. The words victory, Normandy and chivalry are all borrowings of OF words ending in -ie /-i/. They would therefore have ended up having [-àj] in ModE, like cry, had it not been for the loss of the secondary stress. For the -y suffix this occurred quite early in everyday speech, already in late ME, but the secondarily stressed variants survived to be used in poetry and song for quite a while longer. Alexander Gil’s Logonomia Anglica (1619) explicitly remarks that pronouncing three-syllable, initially-stressed words ending in -y with [-ə̀j] is something that can be done in poetry but not in prose. Dobson says that apart from Gil’s, there are few mentions of this feature of poetic speech during the 17th century; we can perhaps take this an indication that it was becoming unusual to pronounce -y as [-ə̀j] even in poetry. I don’t know exactly how long the feature lasted. But General Wolfe is a folk song whose exact year of composition can be identified—1759, the date of General Wolfe’s death—so the feature seems to have been present well into the 18th century.

They’ve let him stand till midsummer day
Till he looked both pale and wan
And Barleycorn, he’s grown a beard
And so become a man
— “John Barleycorn” as sung by The Young Tradition

In ModE wan is pronounced [wɒ́n], with a different vowel from man [man]. But both of them used to have the same vowel as man; in wan the influence of the preceding [w] resulted in rounding to an o-vowel. The origins of this change are traced by Dobson to the East of England during the 15th century. There is evidence of the change from the Paston Letters (a collection of correspondence between members of the Norfolk gentry between 1422 and 1509) and the Cely Papers (a collection of correspondence between wealthy wool merchants owning estates in Essex between 1475 and 1488); the Cely Papers only exhibit the change in the word was, but the change is more extensive in the Paston Letters and in fact seems to have applied before the other labial consonants [b], [f] and [v] too for these letters’ writers.

There is no evidence of the change in Standard English until 1617, when Robert Robinson in The Art of Pronunciation notes that was, wast (as in thou wast) and what have [ɒ́] rather than [á]. The restriction of the change to unstressed function words initially, as in the Cely Papers suggests the change did indeed spread from the Eastern dialects. Later phoneticians during the 17th century record the [ɒ́] pronunciation in more and more words, but the change is not regular at this point; for example, Christopher Cooper (1687) has [ɒ́] in watch but not in wan. According to Dobson, relatively literary words such as wan and quality, not often used in everyday speech, did not reliably have [ɒ́] until the late 18th century.

Note that the change also applied after [wr] in wrath, and that words in which a velar consonant ([k], [g] or [ŋ]) followed the vowel were regular exceptions (cf. wax, wag, twang).

I’ll go down in some lonesome valley
Where no man on earth shall e’er me find
Where the pretty little small birds do change their voices
And every moment blows blusterous winds
— “The Banks of the Sweet Primroses” as sung by the Copper family

The expected ModE pronunciation of OE wind ‘wind’ would be [wájnd], resulting in homophony with find. Indeed, as far as I know, every other monosyllabic word with OE -ind has [-ájnd] in Modern English (mind, grind, bind, kind, hind, rind, …), resulting from an early ME sound change that lengthened final-syllable vowels before [nd] and various other clusters containing two voiced consonants at the same place of articulation (e.g. [-ld] as in wild).

It turns out that [wájnd] did use to be the pronunciation of wind for a long time. The OED entry for wind, written in the early 20th century, actually says that the word is still commonly taken to rhyme with [-ajnd] by “modern poets”; and Bob Copper and co. can be heard pronouncing winds as [wájndz] in their recording of “The Banks of the Sweet Primroses”. The [wínd] pronunciation reportedly became usual in Standard English only in the 17th century. It is hypothesized to be a result of backformation from the derivatives windy and windmill, in which lengthening never occurred because the [nd] cluster was not in word-final position. It is unlikely to be due to avoidance of homophony with the verb wind, because the words spent several centuries being homophonous without any issues arising.

Meeting is pleasure but parting is a grief
And an inconstant lover is worse than a thief
A thief can but rob me and take all I have
But an inconstant lover sends me to the grave
— “The Cuckoo”, as sung by Anne Briggs

As the spelling suggests, the word have used to rhyme with grave. The word was confusingly variable in form in ME, but one of its forms was [haːvə] (rhyming with grave) and another one was [havə]. The latter could have been derived from the former by vowel reduction when the word was unstressed, but this is not the only possible sources of it (e.g. another one would be analogy with the second-person singular form hast, where the a was in a closed open syllable and therefore would have been short); there does not seem to be any consistent conditioning by stress in the forms recorded by 16th- and 17th-century phoneticians, who use both forms quite often. There are some who have conditioning by stress, such as Gil, who explicitly describes [hǽːv] as the stressed form and [hav] as the unstressed form. I don’t know how long [hǽːv] (and its later forms, [hɛ́ːv], [héːv], [héjv]) remained a variant usable in Standard English, but according to the Traditional Ballad Index, “The Cuckoo” is attested no earlier than 1769.

Now the day being gone and the night coming on
Those two little babies sat under a stone
They sobbed and they sighed, they sat there and cried
Those two little babies, they laid down and died
— “Babes in the Wood” as sung by the Copper family

In EModE there was occasional shortening of stressed [ɔ́ː], so that it developed into ModE [ɒ́] rather than [ów] as normal. It is a rather irregular and mysterious process; examples of it which have survived into ModE include gone (< OE ġegān), cloth (< OE clāþ) and hot (< OE hāt). The 16th- and 17th-century phoneticians record many other words which once had variants with shortening that have not survived to the present-day, such as both, loaf, rode, broad and groat. Dobson mentions that Elisha Coles (1675–1679) “knew some variant, perhaps ŏ in stone“; the verse from “Babes in the Wood” above would be additional evidence that stone at some point by some people was pronounced as [stɒn], thus rhyming with on. As far as I know, there is no way it could have been the other way round, with on having [ɔ́ː]; the word on has always had a short vowel.

“So come riddle to me, dear mother,” he said
“Come riddle it all as one
Whether I should marry with Fair Eleanor
Or bring the brown girl home” (× 2)

“Well, the brown girl, she has riches and land
Fair Eleanor, she has none
And so I charge you do my bidding
And bring the brown girl home” (× 2)
— “Lord Thomas and Fair Eleanor” as sung by Peter Bellamy

In “Lord Thomas and Fair Eleanor”, the rhymes on the final consonant are often imperfect (although the consonants are always phonetically similar). These two verses, however, are the only ones where the vowels aren’t the same in the modern pronunciation—and there’s good reason to think they were the same once.

The words one and none are closely related. The OE word for ‘one’ was ān; the OE word for ‘none’ was nān; the OE word for ‘not’ was ne; the second is simply the result of adding the third as a prefix to the first: ‘not one’.

OE ā normally becomes ME [ɔ́ː] and then ModE [ów] in stressed syllables. If it had done that in one and none, it’d be a near-rhyme with home today, save for the difference in the final nasals’ places of articulation. Indeed, in only, which is a derivative of one with the -ly suffix added, we have [ów] in ModE. But the standard ModE pronunciations of one and none are [wʌ́n] and [nʌ́n] respectively. There are also variant forms [wɒ́n] and [nɒ́n] widespread across England. How did this happen? As usual, Dobson has answers.

The [nɒ́n] variant is the easiest one to explain, at least if we consider it in isolation from the others. It’s just the result of sporadic [ɔ́ː]-shortening before [n], as in gone (see above on the onstone rhyme). As for [nʌ́n]—well, ModE [ʌ] is the ordinary reflex of short ME [u], but there is a sporadic [úː]-shortening change in EModE besides the sporadic [ɔ́ː]-shortening one. This change is quite common and reflected in many ModE words such as blood, flood, good, book, cook, wool, although I don’t think there are any where it happens before n. So perhaps [nɔ́ːn] underwent a shift to [nóːn] somehow during the ME period, which would become [núːn] by the Great Vowel Shift. As it happens there is some evidence for such a shift in ME from occasional rhymes in ME texts, such as hoom ‘home’ with doom ‘doom’ and forsothe ‘forsooth’ with bothe ‘bothe’ in the Canterbury Tales. However, there is especially solid evidence for it in the environment after [w], in which environment most instances of ME [ɔ́ː] exhibit raising that has passed into Standard English (e.g. who < OE hwā, two < OE twā, ooze < OE wāse; woe is an exception in ModE, although it, too, is listed as a homophone of woo occasionally by Early Modern phoneticians). Note that although all these examples happen to have lost the [w], presumably by absorption into the following [úː] after the Great Vowel Shift occurred, there are words such as womb with EModE [úː] which have retained their [w], and phoneticians in the 16th and 17th centuries record pronunciations of who and two with retained [w]. So if ME [ɔ́ːn] ‘one’ somehow became [wɔ́ːn], and then raising to [wóːn] occurred due to the /w/, then this vowel would be likely to spread by analogy to its derivative [nɔ́ːn], allowing for the emergence of [wʌ́n] and [nʌ́n] in ModE. The ModE [wɒ́n] and [nɒ́n] pronunciations can be accounted for by assuming the continued existence of an un-raised [wɔ́ːn] variant in EModE alongside [wuːn].

As it happens there is a late ME tendency for [j] to be inserted before long mid front vowels and, a little less commonly, for [w] to be inserted before word-initial long mid back vowels. This glide insertion only happened in initial syllables, and usually only when the vowel was word-initial or the word began with [h]; but there are occasional examples before other consonants such as John Hart’s [mjɛ́ːn] for mean. The Hymn of the Virgin (uncertain date, 14th century), which is written in Welsh orthography and therefore more phonetically transparent than usual, evidences [j] in earth. John Hart records [j] in heal and here, besides mean, and [w] in whole (< OE hāl). 17th-century phoneticians record many instances of [j]- and [w]-insertion, giving spellings such as yer for ‘ere’, yerb for ‘herb’, wuts for ‘oats’ (this one also has shortening)—but they frequently condemn these pronunciations as “barbarous”. Christopher Cooper (1687) even mentions a pronunciation wun for ‘one’, although not without condemning it for its barbarousness. The general picture seems to be that glide insertion was widespread in dialects, and filtered into Standard English to some degree during the 16th century, but there was a strong reaction against it during the 17th century and it mostly disappeared—except, of course, in the word one, which according to Dobson the [wʌ́n] pronunciation becomes normal for around 1700. The [nʌ́n] pronunciation for ‘none’ is first recorded by William Turner in The Art of Spelling and Reading English (1710).

Finally, I should mention that sporadic [úː]-shortening is also recorded as applying to home, resulting in the pronunciation [hʌ́m]; and Turner has this pronunciation, as do many English traditional dialects. So it’s possible that the rhyme in “Lord Thomas and Fair Eleanor” is due to this change having applied to home, rather than preservation of the conservative [-ówn] forms of one and none.

Pure and mixed strategies in prediction

Consider the following very simple game: a Bernoulli trial (a trial which results in one of two possible outcomes, labelled “success” and “failure”) is carried out with success probability p. Beforehand, you are told the value of p and asked to give a definite prediction of the trial’s outcome. That is, you have to predict either success or failure; just saying “the probability of success is p” is not enough. You win if and only if you predict the correct outcome.

Here are two reasonable-sounding strategies for this game:

  1. If p > 0.5, predict success. If p < 0.5, predict failure. If p = 0.5, predict success with probability 0.5 and failure with probability 0.5.
  2. Predict success with probability p and failure with probability 1 - p.

In game-theoretic language, the difference between strategies 1 and 2 is that strategy 1 involves the use of a pure strategy if possible, i.e. one in which the choice of what to predict is made deterministically, while strategy 2 is always mixed, i.e. the choice of what to predict is made randomly.

But which is better? Note that the answer may depend on the value of p. Try to think about it for a minute before moving on to the next paragraph.

If p = 0.5, then the strategies are identical and therefore both equally good.

If p \ne 0.5, let q be the probability of the more probable outcome (i.e. p if p > 0.5 and 1 - p if p < 0.5). If the more probable outcome happens, then you win for sure under strategy 1 but you only have probability q of winning under strategy 2. If the less probable outcome happens, then you lose for sure under strategy 1 but you still have probability 1 - q of winning under strategy 2. Therefore the probability of winning is q \cdot 1 + (1 - q) \cdot 0 = q under strategy 1 and q \cdot q + (1 - q) \cdot (1 - q) = 1 - 2q(1 - q) under strategy 2. So strategy 1 is better than strategy 2 if and only if

\displaystyle q > 1 - 2q(1 - q),


\displaystyle 3q - 2q^2 - 1 > 0.

This quadratic inequality holds if and only if 0.5 < q < 1. But q is the probability of the more probable outcome, and therefore q > 0.5 for sure. Therefore, strategy 1 is always better if p \ne 0.5.

I find this result weird and a little counterintuitive when it’s stated so abstractly. It seems to me like the most natural way of obtaining a definite value from the distribution—drawing randomly from the distribution—should be the best one.

But I guess it does makes sense, if you think about it as applying to a concrete situation. For example, if you were on a jury and you thought there was a 1/1024 probability that the defendant was guilty, it would be crazy to then flip 10 coins and precommit to arguing for the defendant’s guilt if every one of them came up heads. The other jurors would think you were mad (and probably be very angry with you, if they did all come up heads).

The result has interesting implications for how people should act on their beliefs. If you believe that degrees of belief can be usefully modelled as probabilities, and you try to apply this in everyday reasoning, you will often be faced with the problem of deciding whether to act in accordance with a belief’s truth even if you only place a certain probability p on that belief being true. Should you always act in accordance with the belief if p > 0.5, or should you have probability p of acting in accordance with it at any given time? Until I wrote this post it wasn’t obvious to me, but the result in this post suggests you should do the former.

I do wonder if there is anything strategy 2 is good for, though. Comment if you have an idea!

Generating random probability distributions

I’ve been thinking about how to use a computer program to randomly generate a probability distribution of a given finite size. This has turned out to be an interesting problem.

My first idea was to generate n − 1 uniform variates in [0, 1] (where n is the desired size), sort them, add 0 to the front of the list and 1 to the back of the list, and then take the non-negative differences between the adjacent variates. In Python code:

def randpd1(size):
    variates = [random.random() for i in range(size - 1)]
    return [j - i for i, j in zip([0] + variates, variates + [1])]

My second idea was to simply generate n uniform variates in [0, 1], add them together, and take the ratio of each individual variate to the sum.

def randpd2(size):
    variates = [random.random() for i in range(size)]
    s = sum(variates)
    return [i/s for i in variates]

Both of these functions do reliably generate probability distributions, i.e. lists of non-negative real numbers (encoded as Python float objects) that sum to 1, although very large lists generated by randpd2 sometimes sum to something slightly different from 1 due to floating-point imprecision.

>>> sample1 = [randpd1(2**(i//1000 + 1)) for i in range(10000)]
>>> all(all(p >= 0 for p in pd) for pd in sample1)
>>> all(sum(pd) == 1 for pd in sample1)
>>> sample2 = [randpd2(2**(i//1000 + 1)) for i in range(10000)]
>>> all(all(p >= 0 for p in pd) for pd in sample2)
>>> all(sum(pd) == 1 for pd in sample2)
>>> all(math.isclose(sum(pd), 1) for pd in sample2)

But do they both generate a random probability distribution? In precise terms: for a given size argument, are the probability distributions of the return values randpd1(size) and randpd2(size) always uniform?

I don’t really know how to answer this question. In fact, it’s not even clear to me that there is a uniform distribution over the probability distributions of size n for every positive integer n. The problem is that the probability distributions of size n are the solutions in [0, 1]^n of the equation p_1 + p_2 + \dotsb + p_n = 1, where p_1, p_2, … and p_n are dummy variables, and therefore they comprise a set S whose dimension is n − 1 (not n). Because S is missing a dimension, continuous probability distributions over it cannot be defined in the usual way via probability density mappings on \mathbb R^n. Any such mapping would have to assign probability density 0 to every point in S, because for every such point x, there’s a whole additional dimension’s worth of points in every neighbourhood of x which are not in S. But then the integral of the probability density mapping over S would be 0, not 1, and it would not be a probability density mapping.

But perhaps you can map S onto a subset of \mathbb R^{n - 1}, and do something with a uniform distribution over the image. In any case, I’m finding thinking about this very confusing, so I’ll leave it for readers to ponder over. Given that I don’t currently know what a uniform probability distribution over the probability distributions of size n even looks like, I don’t know how to test whether one exists.

I can look at the marginal distributions of the individual items in the returned values of randpd1 and randpd2. But these marginal distributions are not straightforwardly related to the joint distribution of the list as a whole. In particular, uniformity of the joint distribution does not imply uniformity of the marginal distributions, and uniformity of the marginal distributions does not imply uniformity of the joint distribution.

But it’s still interesting to look at the marginal distributions. First of all, they allow validation of another desirable property of the two functions: the marginal distributions are the same for each item (regardless of its position in the list). I’m not going to demonstrate this here because it would be tedious, but it does look like this is the case. Therefore we can speak of “the marginal distribution” without reference to any particular item. Second, they reveal that randpd1 and randpd2 do not do exactly the same thing. The marginal distributions are different for the two functions. Let’s first look just at the case where size is 2.

>>> data1 = [randpd1(2)[0] for i in range(100000)]
>>> plt.hist(data1)


>>> data2 = [randpd2(2)[0] for i in range(100000)]
>>> plt.hist(data2)


The first plot looks like it’s been generated from a uniform distribution over [0, 1]; the second plot looks like it’s been generated from a non-uniform distribution which concentrates the probability density at 1/2. It’s easy to see why the distribution is uniform for randpd1: the function works by generating a single uniform variate p and then returning [min(p, 1 - p), max(p, 1 - p)], and given that the distribution of p is uniform the distribution of 1 - p is also uniform. The function randpd2, on the other hand, works by generating two uniform variates p + q and returning [p/(p + q), q/(p + q)]. However, I don’t know what the distribution of p/(p + q) and q/(p + q) is exactly, given that p and q are uniformly distributed. This is another thing I hope readers who know more about probability and statistics than me might be able to enlighten me on.

Here are the graphs for size 3:

>>> data1 = [randpd1(3)[0] for i in range(100000)]
>>> plt.hist(data1)


>>> data2 = [randpd2(3)[0] for i in range(100000)]
>>> plt.hist(data2)


The marginal distribution for randpd1 is no longer uniform; rather, it’s right triangular, with the right angle at the point (0, 0) and height 2. That means that, roughly speaking, a given item in the list returned by randpd1(3) is twice as likely to be close to 0 as it is to be close to 1/2, and half as likely to be close to 1 as it is to be close to 1/2.

In general, the marginal distribution for randpd1 is the distribution of the minimum of a sample of uniform variates in [0, 1] of size n − 1, where n is the value of size. This is because randpd1 works by generating such a sample, and the minimum of that sample always ends up being the first item in the returned list, and the marginal distributions of the other items are the same as the marginal distribution of the first item.

It turns out to be not too difficult to derive an exact formula for this distribution. For every x \in [0, 1], the minimum is greater than x if and only if all n − 1 variates are greater than x. Therefore the probabilities of these two events are the same. The probability of an individual variate being greater than x is 1 − x (because, given that the variate is uniformly distributed, x is the probability that the variate is less than or equal to x) and therefore, given that the variates are independent of each other, the probability of all being greater than x is (1 - x)^{n - 1}. It follows that the probability of the minimum being less than or equal to x is 1 - (1 - x)^{n - 1}. That is, the cumulative distribution mapping (CDM) f of the marginal distribution for randpd1 is given by

\displaystyle f(x) = 1 - (1 - x)^{n - 1}.

The probability distribution defined by this CDM is a well-known one called the beta distribution with parameters (1, n − 1). That’s a nice result!

The marginal distribution for randpd2, on the other hand, is similar to the one for size 2 except that the mean is now something like 1/3 rather than 1/2, and because the support is still the whole interval [0, 1] this results in a left-skewing of the distribution. Again, I don’t know how to characterize this distribution exactly. Here are the graphs for sizes 4 and 5:

>>> data = [randpd2(4)[0] for i in range(100000)]
>>> plt.hist(data)


>>> data = [randpd2(5)[0] for i in range(100000)]
>>> plt.hist(data2)


It looks like the marginal distribution generally has mean 1/n, or something close to that, for every positive integer n, while still having density approaching 0 at the left limit.

In conclusion… this post doesn’t have a conclusion, it just has a bunch of unanswered questions which I’d like to know the answers to.

  1. Is the concept of a uniform distribution over the probability distributions of size n sensical?
  2. If so, do the returned values of randpd1 and randpd2 have that distribution?
  3. If not, what distributions do they have?
  4. What’s the exact form of the marginal distribution for randpd2?
  5. Which is better: randpd1 or randpd2? Or if one isn’t clearly better than the other, what is each one best suited to being used for?
  6. Are there any other algorithms one could use to generate a random probability distribution?

Modelling communication systems

One of the classes I’m taking this term is about modelling the evolution of communication systems. Everything in the class is done via simulation, which is probably the best way to do it, and certainly necessary at the point where it starts to involve genetic algorithms and such. However, some of the earlier content in the class dealt with problems that I suspected were solvable by a purely mathematical approach, so as somebody with a maths degree I felt it necessary to rise to the challenge and try to derive the solutions mathematically. This post is my attempt to do that.

Let us begin by thinking very abstractly about a system which takes something in and gives something out. Suppose there is a finite, positive number m of things which may be taken in (possible inputs), which we shall call input 1, input 2, … and input m. Suppose likewise that there is a finite, positive number n of things which may be given out (possible outputs), which we shall call output 1, output 2, … and output n.

One way in which the behavior of such a system could be modelled is as a straightforward mapping from inputs to outputs. However, this might be too deterministic: perhaps the system doesn’t always output the same output for a given input. So let’s use a more general model, and think of the system as a mapping from inputs to probability distributions over outputs. For every pair (i, j) of integers such that 0 ≤ im and 0 ≤ jn, let pi, j denote the probability that input i is mapped to output j. The mapping as a whole is determined by the mn probabilities of the form pi, j, and therefore it can be thought of as an m-by-n matrix A:

\displaystyle \mathbf A = \left( \begin{matrix} p_{1, 1} & p_{1, 2} & \hdots & p_{1, n} \\ p_{2, 1} & p_{2, 2} & \hdots & p_{2, n} \\ \vdots & \vdots & \ddots & \vdots \\ p_{m, 1} & p_{m, 2} & \hdots & p_{m, n} \end{matrix} \right).

The rows of A correspond to the possible inputs and the columns of A correspond to the possible outputs. Probabilities are non-negative real numbers, so A is a non-negative real matrix. Also, the probabilities of mutually exclusive, exhaustive outcomes sum to 1, so the sum of each row of A is 1. This condition can be expressed as a system of linear equations:

\displaystyle \begin{aligned} p_{1, 1} &+ p_{1, 2} &+ \hdots &+ p_{1, n} &= 1 \\ p_{2, 1} &+ p_{2, 2} &+ \hdots &+ p_{2, n} &= 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ p_{m, 1} &+ p_{m, 2} &+ \hdots &+ p_{m, n} &= 1. \end{aligned}

Alternatively, and more compactly, it may be expressed as the matrix equation

\displaystyle (1) \quad \mathbf A \mathbf x = \mathbf y,

where x is the n-dimensional vector whose components are all equal to 1 and y is the m-dimensional vector whose components are all equal to 1.

In general, if x is an n-dimensional vector, and we think of x as a random variable determined by the output of the system, then Ax is the vector of expected values of x conditional on each input. That is, for every integer i such that 1 ≤ im, the ith component of Ax is the expected value of x conditional on meaning i being the input to the system.

Accordingly, if we have not just one, but p n-dimensional vectors x1, x2, … and xp (where p is a positive integer), we can think of these p vectors as the columns of an n-by-p matrix B, and then we can read off all the expected values from the matrix product

\displaystyle \mathbf A \mathbf B = \mathbf A \mathbf x_1 + \mathbf A \mathbf x_2 + \dotsb + \mathbf A \mathbf x_n

like so: for every pair (i, k) of integers such that 0 ≤ im and 0 ≤ kp, the (i, k) entry of AB is the expected value of xk conditional on meaning i being the input to the system.

In the case where B happens to be another non-negative real matrix such that

\displaystyle \mathbf B \mathbf x = \mathbf y,

so that the entries of B can be interpreted as probabilities, the matrix B as a whole can be interpreted as another input-output system whose possible inputs happen to be the same as the possible outputs of A. In order to emphasize this identity, let us now call the possible outputs of A (= the possible inputs of B) the signals: signal 1, signal 2, … and signal n. The other things—the possible inputs of A, and the possible outputs of B—can be thought of as meanings. Note that there is no need at the moment for the input meanings (the possible inputs of A) to be the same as the output meanings (the possible outputs of B); we make a distinction between the input meanings and the output meanings.

Together, A and B can be thought of as comprising a “product system” which works like this: an input meaning goes into A, a signal comes out of A, the signal goes into B, and an output meaning comes out of B. For every integer k such that 0 ≤ kp, the random variable xk (the kth column of B) can now be interpreted as the probability of the product system outputting output meaning k, as a random variable whose value is determined by the signal. That is, for every integer j such that 0 ≤ jn, the jth component of xk (the (j, k) entry of B) is the probability of output meaning k coming out if the signal happens to be signal j. It follows by the law of total probability that the probability of output meaning k coming out, if i is the input meaning, is the expected value of xk conditional on i being the input meaning. Now, by what we said a couple of paragraphs above, we have that for every integer i such that 0 ≤ im, the expected value of xk conditional on i being the input meaning is the (i, k) entry of AB. So the “product system”, as a matrix, is the matrix product AB. That’s why we call it the “product system”, see? 🙂

In the case where the possible input meanings are the same as the possible output meanings and m = p, we may think about the “product system” as a communicative dyad. The speaker is A, the hearer is B. The speaker is trying to express a meaning, the input meaning, and producing a signal in order to do so, and the hearer is interpreting that signal to have some meaning, the output meaning. The output meaning the hearer understands is not necessarily the same as the input meaning the speaker was trying to express. If it is different, we may regard the communication as unsuccessful; if it is the same, we may regard the communication as successful.

The key question is: what is the probability that the communication is successful? Given the considerations above, it’s very easy to answer. If the input meaning is i, we’re just looking for the probability that output meaning i given this input meaning. That probability is simply the (i, i) entry of AB, i.e. the ith entry along AB‘s main diagonal.

What if the input meaning isn’t fixed? Then the answer will in general depend on the probability distribution over the possible input meanings. But in the simplest case, where the distribution is uniform (no input meaning is any more probable than any other), the probability of successful communication is just the mean of the input meaning-specific probabilities, that is, the sum of the main diagonal entries of AB, divided by m (the number of the main diagonal entries, i.e. the number of meanings). In linear algebra, we call the sum of the main diagonal entries of a square matrix its trace, and we denote it by tr(C) where C is the matrix. So our formula for the communication success probability p is

\displaystyle (2) \quad p = \frac {\mathrm{tr}(\mathbf A \mathbf B)} m.

If the probability distribution over the input meanings isn’t uniform, the probability of successful communication is just the weighted average of the input meaning-specific probabilities, with the weights being the respective input meaning probabilities. The general formula can therefore be written as

(3) \quad \displaystyle p = \mathrm{tr}(\mathbf A \mathbf B \mathbf D) = \mathrm{tr}(\mathbf D \mathbf A \mathbf B)

where D is the diagonal matrix of size m whose main diagonal is the probability distribution over the input meanings (i.e. for every integer i such that 0 ≤ im, the ith diagonal entry of D is the probability of input meaning i being the one the speaker tries to express). It doesn’t matter whether D is left-multiplied or right-multiplied, because the trace of the product is the same in either case. In the case where the probability distribution over the input meanings is uniform the diagonal entries of D are all equal to 1/m, i.e \mathbf D = \mathbf I_m/m, where Im is the identity matrix of size m, and therefore (3) reduces to (2).

To leave you fully convinced that this formula works, here are some simulations. The 5 graphs below were generated using a Python script which you can view on GitHub. Each one involves 3 possible meanings, 3 possible signals, randomly-generated speaker and hearer matrices and a randomly-generated probability distribution over the input meanings. If you look at the code, you’ll see that the blue line is generated by simulating communication in the obvious way, by randomly drawing an input meaning, randomly drawing a signal based on that particular input meaning, and finally randomly drawing an output meaning based on that particular signal. The position on the x-axis corresponds to the number of trials (individual simulated communicative acts) carried out so far and the position on the y-axis corresponds to the proportion of those trials involving a successful communication (one where the output meaning ended up being the same as the input meaning). For each graph, there were 10 sets of 500 trials; each individual set of trials corresponds to one of the light blue lines, while the darker blue lines gives the results averaged over those ten sets. The horizontal green line indicates the success probability as calculated by our formula. This should be close to the success proportion for a large number of trials, so we should see the blue and green lines converging on the right side of each graph. That is what we see, so the formula works.


The mathematics of surprise, part 1

I’ve split this post into two parts because it would be really long otherwise. Part 2 will be coming up later, hopefully.


Let’s think about how we might define surprisingness as a mathematical quantity.

Surprisingness is a property of events as perceived by observers. After the event occurs, the observer is surprised to a certain degree. The degree of surprisingness depends on how likely the observer thought it was that the event would occur, or to put it briefly the subjective probability of the event. In fact, to a reasonable approximation, at least, the degree of surprisingness depends only on the subjective probability of the event. The greater the subjective probability, the less the surprise.

For example, if you flip a coin and it comes up heads, this is not very surprising—the probability of this event was only 1/2, which is reasonably high. But if you flip ten coins and they all come up heads, you are much more surprised, due to the fact that the probability of ten heads in ten flips is only

\displaystyle \left( \frac 1 2 \right)^{10} = \frac 1 {1024}.

This is a much smaller quantity than 1/2, hence the large increase in surprise. Even if you are unable to do the calculation and work out that the probability of ten heads in ten flips is exactly 1/1024, you are probably aware on an intuitive level that it is a lot smaller than 1/2.

These considerations suggest surprisingness might be definable as a mapping S on [0, 1] (the set of the real numbers between 0 and 1, inclusive), such that for every p \in [0, 1] (i.e. every member p of [0, 1]), the value S(p) is the common surprisingness of the events of subjective probability p. Listed below are three properties such a mapping S should have, if it is to be in accordance with the everyday meaning of “surprise”.

Property 1. For every p \in [0, 1] (i.e. every member p of [0, 1]), the value S(p) should be a non-negative real number, or +\infty (positive infinity).

Justification. Surprisingness seems to be reasonably well-modelled as (to use statistics jargon) an interval variable: two surprisingness values can be compared to see which is less and which is greater, there is a sensical notion of “distance” between surprisingness values, etc. Therefore, it makes sense for surprisingness to be represented by a real number. Moreover, there is a natural minimum surprisingness, namely the surprisingness of an event of subjective probability 1 (i.e. which the observer was sure would happen)—such an event is not at all surprising, and it makes sense to speak of being, e.g., “twice as surprised” by one event compared to another with reference to this natural minimum. To use statistics jargon again, surprisingness is a ratio variable. Naturally, it makes sense to set the minimum value at 0.

Property 2. The mapping S should be strictly decreasing. That is, for every pair p, q of members of [0, 1] such that p < q, we should have S(q) < S(p).

Justification. As said above, events of high subjective probility are less surprising than those of low subjective probability.

Property 3. For every pair p, q of members of [0, 1], we should have S(pq) = S(p) + S(q).

Justification. Suppose A and B are independent events of subjective probabilities p and q respectively. Then, assuming subjective probabilities are assigned in a consistent way, obeying the laws of probability, the subjective probability of A \cap B (the event that both A and B occur) is pq and therefore the surprisingness of A \cap B is S(pq). But given that A and B are independent (so the observation of one does not change the observer’s subjective probability of the other), the surprisingness of A \cap B overall, i.e. S(pq), should be the same as the total surprisingness of the individual observations of A and B, i.e. S(p) + S(q).

Remarkably, these three properties determine the form of S almost exactly (up to a vertical scale factor). Feel free to skip the proof below; it is somewhat long and tedious and doesn’t use any ideas that will be used later on in the post, and I haven’t put in much effort to make it easy to follow.

Theorem 1. The mappings S on [0, 1] having properties 1–3 above are those of the form p : [0, 1] \mapsto -\log_b p (i.e. those mappings S on [0, 1] such that S(p) = - \log_b p for every p \in [0, 1]), where b is a real number greater than 1.

Proof. Suppose S is a mapping on [0, 1] having properties 1–3 above. Let f be the composition of S and x : \mathbb R \mapsto e^x. Then the domain of f is the set of the x \in \mathbb R such that 0 \le e^x \le 1, i.e. x \le 0; and for every pair x, y of non-positive real numbers, we have

\begin{aligned}  f(x, y) &= S(e^{x + y}) \\  &= S(e^x e^y) \\  &= S(e^x) + S(e^y) \\  &= f(x) + f(y).  \end{aligned}

In the case where x = y = 0, we see that f(0 + 0) = f(0) + f(0), and we also have f(0 + 0) = f(0) because 0 + 0 = 0, so combining the two we have f(0) = f(0) + f(0), from which it follows by subtraction of f(0) from both sides that 0 = f(0). Similarly, for every non-positive x \in \mathbb R and every non-negative n \in \mathbb Z (the set \mathbb Z is the set of the integers) we have f((n + 1) x) = f(nx + x) = f(nx) + f(x), and if we assume as an inductive hypothesis that f(nx) = nf(x) it follows that f((n + 1) x) = (n + 1) f(x). This proves by induction that f(nx) = nf(x).

Now, let c = f(-1). For every non-positive x \in \mathbb Q (the set \mathbb Q is the set of the rational numbers), there is a non-positive m \in \mathbb Z and a positive n \in \mathbb Z such that x = m/n, and therefore

\begin{aligned}  nf(x) &= nf \left( \frac m n \right) \\  &= f(m) \\  &= f((-m)(-1)) \\  &= (-m)f(-1) \\  &= -mc,   \end{aligned}

from which it follows that f(x) = -mc/n = -cx.

The equation f(x) = -cx in fact holds for every non-positive x \in \mathbb R, not just for x \in \mathbb Q. To prove this, first, observe that f is strictly decreasing, because for every pair x, y of non-positive real numbers such that x < y, we have e^x < e^y and therefore f(y) = S(e^y) < S(e^x) = f(x) (the mapping S being strictly decreasing by property 2). Second, suppose x is a non-positive real number and f(x) \ne -cx. Then -f(x)/c \ne x, i.e. either -f(x)/c < x (case 1) or x < -f(x)/c (case 2). Because every non-empty interval contains a rational number, it follows that there is an a \in \mathbb Q such that -f(x)/c < a < x (in case 1) or x < a < -f(x)/c (in case 2).

In both cases, we have a \le 0. This is obvious in case 1, because x \le 0. As for case 2, observe first that S is non-negative-valued and therefore -f(x) \le 0. If c is positive, it will then follow that -f(x)/c \le 0 and therefore a \le 0 (because a < -f(x)/c). To prove that c is positive, observe that S is strictly decreasing, and we have f(1) = c, i.e. S(1/e) = c, and f(0) = 0, i.e. S(1) = 0.

Because a \le 0 in both cases, we have f(a) = -ca. And by the strict decreasingness of f it follows that either f(x) < -ca (in case 1) or -ca < f(x) (in case 2). Rearranging these two inequalities gives us a < -f(x)/c (in case 1) or -f(x)/c < a (in case 2). But we already have -f(x)/c < a (in case 1) or a < -f(x)/c (in case 2), so in both cases there is a contradiction. It follows that f(x) \ne -cx cannot hold; we must have f(x) = -cx.

Finishing off, note that for every p \in [0, 1], we have S(p) = S(e^{\log p}) = f(\log p) = -c \log p. Let b = e^{1/c}; then b > 1 because c > 0, and \log b = 1/c, i.e. c = 1/(\log b), from which it follows that S(p) = -(\log p)/(\log b) = -\log_b p for every p \in [0, 1]. From this we can conclude that for every mapping S on [0, 1] such that properties 1–3 hold, we have

\displaystyle S = p : [0, 1] \mapsto -\log_b p,

where b is a real number greater than 1. \blacksquare

In the rest of this post, suppose b is a real number greater than 1. The surprisingness mapping S will be defined by

\displaystyle S = p : [0, 1] \mapsto -\log_b p.

It doesn’t really make any essential difference to the nature of S which value of b is chosen, because for every p \in [0, 1], we have S(p) = -\log_b p = -(\log p)/(\log b) (where the \log symbol with no base specified refers to the natural logarithm, the one to base e), and therefore the choice of base amounts to a choice of vertical scaling. Below are three plots of surprisingness against probability for three different bases (2, e and 10); you can see that the shape of the graph is the same for each base, but the vertical scaling is different.

Note that regardless of the choice of base, we have

\begin{aligned}  S(1) &= -\log_b 1 = -0 = 0, \\  S(0) &= -\log_b 0 = -(-\infty) = \infty.  \end{aligned}

That is, events thought sure to occur are entirely unsurprising, and events thought sure not to occur are infinitely surprising when they do happen.


\displaystyle S \left( \frac 1 b \right) = -\log_b \left( \frac 1 b \right) = -(-1) = 1.

I’m inclined to say on this basis that 2 is the most sensible choice of base, because of all the possible probabilities other than 0 and 1, the one which is most special and thus most deserving of corresponding to the unit surprisingness is probably 1/2, the probability exactly halfway between 0 and 1. However, base e is also quite convenient because it simplifies some of the formulae we’ll see later on in this post.


First, two basic definitions from probability theory (for readers who may not be familiar with them). Suppose A is a finite set.

Definition 1. The probability distributions on A are the positive real-valued mappings f on A such that

\displaystyle (1) \quad \sum_{x \in A} f(x) = 1.

The probability distributions on A can be thought of as random generators of members of A. Each member of A is generated with probability equal to the positive real number less than or equal to 1 associated with that member. The probabilities of generation for each member have to add up to 1, and this is expressed by equation (1). Note that the requirement that the probabilities add up to 1 implies the requirement that the probabilities are each less than or equal to 1, so there is no need to explicitly state the latter requirement in the definition.

Henceforth, suppose f is a probability distribution on A.

Definition 2. For every real-valued mapping \tau on A, the expected value or mean \mathrm E_f(\tau) of the transformation of f by \tau is given by the formula

\displaystyle (2) \quad \mathrm E_f(\tau) = \sum_{x \in A} f(x) \tau(x).

Let x_1, x_2, … and x_n be the members of A and let p_1 = f(x_1), p_2 = f(x_2), … and p_n = f(x_n). Then for every real-valued mapping \tau on A, the expected value \mathrm E_f(\tau) is the weighted average of \tau(x_1), \tau(x_2), … and \tau(x_n), with the weights the respective probabilities p_1, p_2, … and p_n.

If a very large number N of values are generated (independently) by f, then the average value under \tau of these values can be calculated as

\displaystyle \frac 1 N \sum_{k = 1}^n f_k \tau(x_k) = \sum_{k = 1}^n \frac {f_k} N \tau(x_k),

where f_1, f_2, … and f_n are the frequencies of the values x_1, x_2, … and x_n in the sample. Given that N is very large it is very likely that the relative frequencies f_1/N, f_2/N, … and f_n/N will be close to p_1, p_2, … and p_n, respectively, and therefore the average will be close to \mathrm E_f(\tau).

Now, the definition of the key concept of this post:

Definition 3. The expected surprisingness or entropy \mathrm H_f (that’s the Greek capital letter eta, not the Latin capital letter H) of f is \mathrm E_f(S \circ f) (here, S \circ f is the composition of S and f, i.e. x : A \mapsto S(f(x))).

The use of the word “entropy” here is closely related to the use of the same word in thermodynamics. However, I won’t go into the connection here (I don’t really know enough about thermodynamics to be able to talk intelligently about it).

Having the concept of entropy to hand gives us a fun new question we can ask about any given probability distribution: what is its entropy, or, more evocatively, how surprising is it?

By (2), we have

\displaystyle (3) \quad \mathrm H_f = \sum_{x \in A} f(x) S(f(x)).

Using the definition of S, this can also be written as

\displaystyle (4) \quad \mathrm H_f = -\sum_{x \in A} f(x) \log_b f(x).

Using logarithm properties, it can even be written as

\displaystyle \quad \mathrm H_f = \log_b \frac 1 {\prod_{x \in A} f(x)^{f(x)}}.

This shows that \mathrm H_f is the logarithm of the reciprocal of

(5) \quad \displaystyle \prod_{x \in A} f(x)^{f(x)},

which is the geometric weighted average of p_1, p_2, … and p_n, with the weights being identical to the values averaged. A geometric weighted average is similar to an ordinary weighted average, except that the values averaged are multiplied rather than added and the weights are exponents rather than factors. The product (5) can therefore be thought of as the “expected probability” of the value generated by f. To the extent that f is likely to generate one of the members of A which has a low individual probability of being generated, the product (5) is small and accordingly \mathrm H_f is large. This may help give you a slightly more concrete idea why \mathrm H_f can be thought of as the expected surprisingness of f.

Note that the value of \mathrm H_f is determined completely by the probabilities p_1, p_2, … and p_n. The values x_1, x_2, … and x_n are irrelevant in and of themselves. This is evident from the formula (3), in which the expression x only appears as a sub-expression of f(x). To understand how \mathrm H_f is determined by these probabilities, it helps to look at the graph of the mapping p : (0, 1] \mapsto p S(p), which I have plotted below. The entropy \mathrm H_f is a sum of exactly n values of this mapping.

It can be seen from the graph that the mapping has a maximum value. Using calculus, we can figure out what this maximum value is and exactly where it is attained.

Theorem 1. The maximum value attained by p : [0, 1] \mapsto p S(p) is 1/b, and this maximum is attained at the point 1/b (and nowhere else).

Proof. The derivative of p : (0, +\infty) \mapsto p (-\log_b p) is p : (0, +\infty) \mapsto -(1 + \log_b p) (by the product rule), which is positive if p < 1/b (because then \log_b p < -1), equal to 0 if p = 1/b (because then \log_b p = -1), and negative if p > 1/b (because then \log_b p > -1). Therefore p : (0, \infty) \mapsto p (-\log_b p) increases in value from the vicinity of 0 to 1/b and decreases in value from 1/b towards +\infty, which means it attains a minimum at 1/b. \blacksquare

Because \mathrm H_f is a sum of n values of p : [0, 1] \mapsto p S(p), we may conclude that the inequality

\displaystyle \mathrm H_f \le \frac n b

always holds. However, this upper bound on the value of H_f can be improved, as we’ll see below. After all, in proving that it holds we’ve made no use of equation (1).

Cross entropy

The concept of cross entropy is a useful generalization of the concept of entropy.

Suppose g another probability distribution on A. Think of f and g as two different models of the random member-of-A generator: f is the right model (or a more right model, if you don’t like the idea of one single right model), and g is an observer’s working model. The probabilities p_1, p_2, … and p_n can be thought of as the real probabilities of generation of x_1, x_2, … and x_n, respectively, while the probabilities q_1 = g(x_1), q_2 = g(x_2), … and q_n = g(x_n) can be thought of as the observer’s subjective probabilities. Although the real probabilities determine what the observer observes, the subjective probabilities are what determine how surprised the observer is by what they observe. Therefore, if the observer calculates entropy by averaging their surprisingness over a large number of observations, they will get something close to

\displaystyle \sum_{k = 1}^n p_k S(q_k),

i.e. \mathrm E_f(S \circ g). This quantity is called the cross entropy from g to f and denoted \mathrm H(f, g). Note that if f = g then \mathrm H(f, g) = \mathrm H(f) = \mathrm H(g); that’s why the concept of cross entropy is a generalization of the concept of entropy. Note also that \mathrm H(f, g) is not necessarily the same as \mathrm H(g, f).

Your intuition should tell you that \mathrm H(f, g) will always be greater than \mathrm H(f), if f \ne g. Why? Think about it: \mathrm H(f, g) is the expected surprisingness if the observer has the wrong model, \mathrm H(f, f) is the expected surprisingness if the observer has the right model. Having the right model should lead to the observer being better able to predict which member of A is generated and thus to the observer being less likely to be surprised.

It is indeed the case that

\displaystyle (6) \quad \mathrm H(f, g) > \mathrm H(f)

if \mathrm f \ne g (which further reassures us that S is a very good mathematical model of the intuitive notion of surprisingness). The inequality (6) is called Gibbs’ inequality. In order to prove it, first, observe that it may be rewritten as

\displaystyle (7) \quad \mathrm H(f, g) - \mathrm H(f) > 0.

Now, the quantity on the left-hand side of (7) has a name of its own: it’s called the Kullback-Leibler divergence from g to f and denoted \mathrm D(f, g). It measures the “penalty” in increased surprisingness the observer gets for having the wrong model; it can also be thought of as a measure of how different the probability distributions f and g are from each other, hence the name “divergence”. As for the “Kullback-Leibler” part, that’s just because mathematicians have come up with lots of different ways of measuring how different two probability distributions are from each other and Kullback and Leibler were the mathematicians who came up with this particular measure. I won’t be referring to any other such measures in this post, however, so whenever I need to refer to Kullback-Leibler divergence again I’ll just refer to it as “divergence”.

So Gibb’s inequality, reformulated, states that the divergence between two unequal probability distributions is always positive. To prove this, it’s helpful to first write out an explicit expression for \mathrm D(f, g):

\displaystyle \begin{aligned}  \mathrm D(f, g) &= \mathrm H(f, g) - \mathrm H(f) \\  &= \sum_{x \in A} f(x) S(g(x)) - \sum_{x \in A} f(x) S(f(x)) \\  &= \sum_{x \in A} f(x) (S(g(x)) - S(f(x)) \\  &= \sum_{x \in A} f(x) (\log_b f(x) - \log_b g(x)) \\  &= \sum_{x \in A} f(x) \log_b \frac {f(x)} {g(x)}.  \end{aligned}

Second, we prove a lemma.

Lemma 1. For every positive x \in \mathbb R, we have \log_b x \ge (x - 1)/(x \log b), where \log b is the natural logarithm (logarithm to base e) of b, with equality if and only if x = 1.

Proof. The inequality in question is equivalent to \log x \ge (x - 1)/x (by multiplication of both sides by \log b). The right-hand side of this inequality is equal to 1 - 1/x. Consider the mapping x : (0, +\infty) \mapsto \log x - (1 - 1/x). The derivative of this mapping is x : (0, +\infty) \mapsto 1/x - 1/x^2, which is negative if x < 1 (because then x^2 < x and therefore 1/x < 1/x^2), equal to 0 if x = 1 (because then x^2 = x) and positive if x > 1 (because then x^2 > x and therefore 1/x > 1/x^2). Therefore x : (0, +\infty) \mapsto \log x - (1 - 1/x) is strictly decreasing on (0, 1) and strictly increasing on (1, +\infty). It follows that its minimum value is attained at the point 1. And that minimum value is \log x - (1 - 1/x) = 0 - (1 - 1) = 0 - 0 = 0. \blacksquare

Using Lemma 1, we have

\displaystyle (8) \quad \log_b \frac {f(x)} {g(x)} \ge \frac {f(x)/g(x) - 1} {(f(x)/g(x)) \log b} = \frac {f(x) - g(x)} {f(x) \log b}

for every x \in A, with equality if and only if f(x)/g(x) = 1, i.e. f(x) = g(x). Given that f \ne g, there is at least one x \in A such that f(x) \ne g(x) and therefore (8) holds without equality. It follows that

\begin{aligned}  \mathrm D(f, g) &> \sum_{x \in A} f(x) \frac {f(x) - g(x)} {f(x) \log b} \\  &= \frac 1 {\log b} \sum_{x \in A} (f(x) - g(x)) \\  &= \frac 1 {\log b} \left( \sum_{x \in A} f(x) - \sum_{x \in A} g(x) \right) \\  &= \frac {1 - 1} {\log b} \\  &= 0,  \end{aligned}

which proves Gibbs’ inequality.

Gibbs’ inequality is quite powerful and useful. For example, it can be used to figure out what the maximum possible entropy is. Suppose g is such that \mathrm H(f, g) = \mathrm H(g), regardless of the value of f. Then if f \ne g, we have \mathrm H(g) > \mathrm H(f) by Gibbs’ inequality, and therefore \mathrm H(g) is the maximum entropy possible for probability distributions on A and g is the unique probability distribution on A whose entropy is \mathrm H(g). Is there a probability distribution g on A such that \mathrm H(f, g) = \mathrm H(g), regardless of the value of f? There is indeed. Let g be the uniform distribution on A, i.e. the mapping x : A \mapsto 1/n (remember, n is the number of members A has). Then

\displaystyle \sum_{x \in A} g(x) = \frac n n = 1

so g is a probability distribution, and regardless of the value of f we have

\begin{aligned}  \mathrm H(f, g) &= \sum_{x \in A} f(x) \left(-\log_b \frac 1 n \right) \\  &= \left( \sum_{x \in A} f(x) \right) \log_b n \\  &= \log_b n.  \end{aligned}

Therefore, the maximum entropy possible on A is \log_b n, and the uniform distribution on A is the probability distribution which attains this maximum.

A consequence of this is that the divergence of f from the uniform distribution on A is given by

\displaystyle \mathrm D(f, g) = \mathrm H(f, g) - \mathrm H(f) = \log_b - \mathrm H(f)

which is just the negation of \mathrm H(f) plus a constant \log_b n (well, a constant depending on the size of the set A which f is distributed over). Therefore, among probability distributions on A specifically, entropy can be thought of as a measure of divergence from the uniform distribution. Among probability distributions in general, entropy is a measure of both divergence from the uniform distribution and the size of the distributed-over set.

Philosophical implications

So far, we’ve seen that the informal concept of surprisingness can be formalized mathematically with quite a high degree of success—one might even say a surprising degree of success, which is fitting—and that’s pretty neat. But there are also some deeper philosophical issues related to all this. I’ve avoided talking about them up till now because philosophy is not really my field, but I couldn’t let them go completely unmentioned.

Suppose that you know that one of n values x_1, _2, … and x_n values (where n is a positive integer) will be generated by some probability distribution f, but you don’t know the probabilities; you have absolutely no information about the probability distribution other than the set of values it may generate. What should your subjective probability distribution be? A possible answer to this question is that you shouldn’t have one—you simply don’t know the probabilities, and that’s all that can be said. And that’s reasonable enough, especially if you don’t like the concept of subjective probability or think it’s incoherent (e.g. if you’re a strict frequentist when it comes to interpreting probability). But if you accept that subjective probabilities are things it makes sense to talk about, well, the whole point of subjective probabilities is that they represent states of knowledge under uncertainty, so there’s no reason in principle to avoid having a subjective probability distribution just because this particular state of knowledge is particularly uncertain. Of course this subjective probability distribution may change as you gather more information—think of it as your “best guess” to start with, to be improved later on.

There are some people who’d say that you can choose your initial “best guess” on an essentially arbitrary basis, according to your own personal whims. No matter which subjective probability distribution you choose to start with, it will get better and better at modelling reality as you change it as you gather more information; the initial state isn’t really important. The initial state is of interest only in that if we have some updating mechanism in mind, we’d like to be able to prove that convergence to the real probability distribution will always happen independently of the initial state.

There is another position which can be taken, however, which is that there is in fact a certain objectively best subjective probability distribution to start with. This position is associated with the marquis de Laplace (1749–1827), who wrote a classic text on probability theory, the Théorie analytique des probabilités (Laplace was also a great contributor to many other fields of mathematics, as mathematicians back then tended to be—they didn’t do specialization back then). In Laplace’s opinion, the correct distribution to start with was the uniform distribution. That is, given that we know nothing about the probabilities, assume they are all the same (after all, we have no reason to believe any one is larger than any other). This principle is called the principle of indifference.

The concept of probability as a whole can be based on the idea of the principle of indifference. The idea would be that on some level, any probability distribution is over a set of interchangeable values and therefore uniform. However, we are often interested only in whether one of a particular class of values is generated (not in which particular member of that class) and the probability of interest in that case is the sum of the probabilities of each of the values in that class, which, because the underlying distribution is uniform, can also be expressed as the ratio of the number of values in the class to the total number of values which may be generated. I don’t know how far this is how Laplace thought of probability; I don’t want to be too hasty to attribute views to a 19th-century author which might be out of context or just incorrect (after all, I can’t read French so all my information about Laplace is second-hand).

It’s not hard to argue with the principle of indifference. It’s quite difficult to think of any reasonable justification for it at all. It was indeed attacked in vehement terms by later probability theorists, such as John Venn (1834–1923) (that’s right, the Venn diagram guy). In modern times, however, it has been championed by the statistical physicist E. T. Jaynes (1922–1998), who also came up with an interesting extension of it.

In the mathematical section of this post, we saw that the uniform distribution over x_1, x_2, … and x_n was the one with the most entropy, i.e. the one which is “most surprising on average”. Therefore, a natural generalization of the principle of indifference would be to say that in any circumstance in which a subjective probability distribution must be assumed in a situation of uncertainty, one should assume whichever distribution has the most entropy while still being compatible with what is known. For example, if it is known what the mean is then the assumed distribution should be the one with the most entropy among all distributions having that specific mean. This is called the principle of maximum entropy or MaxEnt for short.

The MaxEnt principle makes sense intuitively, sort of, in a way that the principle of indifference on its own doesn’t. If you don’t know something about the probability distribution, you should expect to be surprised more often by the values it generates than if you do know that something. It’s still on fairly shaky ground, though, and I don’t know how far it is accepted by people other than Jaynes as a normative principle, as opposed to just one way in which you might choose the subjective probability distribution to start with, in competition with other ways of doing so on the basis of strictly pragmatic considerations (which seems to be how a lot of people in the practical applications side of things view it). In any case it gives us a philosophical motivation for examining the mathematical problem of finding the probability distributions that have the most entropy given particular constraints. The mathematical problem is interesting itself, but the philosophical connection makes it more interesting.


In part 2 of this post, I’m going to describe how the famous normal distribution can be characterized as a maximum entropy distribution: namely, if it’s the normal distribution with mean \mu (a real number) and standard deviation \sigma (a positive real number), then it’s the absolutely probability distribution over \mathbb R with the most entropy among all absolutely continuous probability distributions over \mathbb R with mean \mu and standard deviation \sigma. That roughly means that by MaxEnt, if you know nothing about a continuous probability distribution other than that it has a particular mean and a particular standard deviation, your best guess to start with is that it’s normal. You can understand the famous Central Limit Theorem from that perspective: as you add up independent, identically distributed random variables, the mean and variance of the distribution in question will be carried over into the sum (elementary statistical theory tells us that the mean of any sum of random variables is the sum of the individual means, and the variance of any sum of independent random variables is the sum of the individual variances), but every other distinctive property of the distribution is gradually kneaded out by the summing process, so that as the sum gets larger and larger all we can say about the probability distribution of the sum is that it has this particular mean and this particular variance. I intend to finish off part 2 with a proof of the Central Limit theorem from this perspective, although that might be a little ambitious. Before that, though, the other big thing which I need to cover in part 2 is defining entropy in the case of a continuous probability distribution—I’ve only been talking about discrete probability distributions in part 1, and it turns out the extension is not entirely a straightforward matter.