Monthly Archives: November 2016

A very simple stochastic model of diachronic change

1. The discrete process

1.1. The problem

Consider an entity (for example, a language) which may or may not have a particular property (for example, obligatory coding of grammatical number). For convenience and interpretation-neutrality, we shall say that the entity is positive if it has this property and negative if it does not have this property. Consider the entity as it changes over the course of a number of events (for example, transmissions of the language from one generation to another) in which the entity’s state (whether it is positive or negative) may or may not change. For every nonnegative integer {n}, let {X_n} represent the entity’s state after exactly {n} events have occurred, with negativity being represented by 0 and positivity being represented by 1. The initial state {X_0} is a constant parameter of the model, but the states at other times are random variable whose “success” probabilities (i.e. values of 1 under their probability mass functions) are determined by {X_0} and the other parameters of the model.

The other parameters of the model, besides {X_0}, are denoted by {p} and {q}. These represent the probabilities that an event will change the state from negative to positive or from positive to negative, respectively. They are assumed to be constant across events—this assumption can be thought of as an interpretation of the uniformitarian principle familiar from historical linguistics and other fields. I shall call a change of state from negative to positive a gain and a change of state from positive to negative a loss, so that {p} can be thought of as the gain rate per event and {q} can be thought of as the loss rate per event.

Note that the gain resp. loss probability is {p}/{q} only if the state is negative resp. positive as the event begins. If the state is already positive resp. negative as the event begins then it is impossible for a further gain resp. loss to occur and therefore the gain resp. loss probability is 0 (but the loss resp. gain probability is {q}/{p}). Thus the random variables {X_1}, {X_2}, {X_3}, … are not necessarily independent of one another.

I am aware that there’s a name for a sequence of random variables that are not necessarily independent of one another, namely “stochastic process”. However, that is about the extent of what I know about stochastic processes. I think the thing I’m talking about in this post is a very simple example of a stochastic process–an appropriate name for it would be the gain-loss process. If you know something about stochastic processes it might seem very trivial, but it was an interesting problem for me to try to figure out knowing nothing already about stochastic processes.

1.2. The solution

Suppose {n} is a nonnegative integer and consider the state {X_{n + 1}} after exactly {n + 1} events have occurred. If the entity is negative as the {(n + 1)}th event begins, the probability of gain during the {(n + 1)}th event is {p}. If the entity is positive as the {(n + 1)}th event begins, the probability of loss during the {(n + 1)}th event is {q}. Now, as the {(n + 1)}th event begins, exactly {n} events have already occurred. Therefore the probability that the entity is negative as the {(n + 1)}th event begins is {\mathrm P(X_n = 0)} and the probability that the entity is positive as the {(n + 1)}th event begins is {\mathrm P(X_n = 1)}. It follows by the law of total probability that

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1) &= p (1 - \mathrm P(X_n = 1)) + (1 - q) \mathrm P(X_n = 1) \\ &= p - p \mathrm P(X_n = 1) + \mathrm P(X_n = 1) - q \mathrm P(X_n = 1) \\ &= p - (p - 1 + q) \mathrm P(X_n = 1) \\ &= p + (1 - p - q) \mathrm P(X_n = 1). \end{aligned}

This recurrence relation can be solved using the highly sophisticated method of “use it to find general equations for the first few terms in the sequence, extrapolate the pattern, and confirm that the extrapolation is valid using a proof by induction”. I’ll spare you the laborious first phrase, and just show you the second and third. The solution is

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= p \sum_{i = 0}^{n - 1} (1 - p - q)^i, \\ \mathrm P(X_n = 1 | X_0 = 1) &= 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i. \end{aligned}

Just so you can check that this is correct, the proofs by induction for the separate cases are given below.

Case 1 ({X_0 = 0)}. Base case. The expression

\displaystyle p \sum_{i = 0}^{n - 1} (1 - p - q)^i

evaluates to 0 if {n = 0}, because the sum is empty.

Successor case. For every nonnegative integer {n} such that

\displaystyle \mathrm P(X_n = 1 | X_0 = 0) = p \sum_{i = 0}^{n - 1} (1 - p - q)^i,

we have

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1 | X_0 = 0) &= p + (1 - p + q) \mathrm P(X_n = 1 | X_0 = 0) \\ &= p + (1 - p - q) p \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= p + p (1 - p - q) \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= p \left( 1 + \sum_{i = 0}^{n - 1} (1 - p - q)^{i + 1} \right) \\ &= p \sum_{j = 0}^n (1 - p - q)^j. \end{aligned}

Case 2 ({X_0 = 1}). Base case. The expression

\displaystyle 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i

evaluates to 1 if {n = 0}, because the sum is empty.

Successor case. For every nonnegative integer {n} such that

\displaystyle \mathrm P(X_n = 1 | X_0 = 1) = 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i,

we have

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1 | X_0 = 1) &= p + (1 - p + q) \mathrm P(X_n = 1 | X_0 = 1) \\ &= p + (1 - p - q) \left( 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i \right) \\ &= p + 1 - p - q - (1 - p - q) q \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= 1 - q - q (1 - p - q) \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= 1 - q \left( 1 + \sum_{i = 0}^{n - 1} (1 - p - q)^{i + 1} \right) \\ &= 1 - q \sum_{j = 0}^n (1 - p - q)^j. \end{aligned}

I don’t know if there is any way to make sense of why exactly these equations are the way they are; if you have any ideas, I’d be interested to hear your comments. There is a nice way I can see of understanding the difference between the two cases. Consider an additional gain-loss process {B} which changes in tandem with the gain-loss process {A} that we’ve been considering up till just now, so that its state is always the opposite of that of {A}. Then the gain rate of {B} is {q} (because if {B} gains, {A} loses) and the lose rate of {B} is {p} (because if {B} loses, {A} gains). And for every nonnegative integer {n}, if we let {Y_n} denote the state of {B} after exactly {n} events have occurred, then

\displaystyle \mathrm P(Y_n = 1) = 1 - \mathrm P(X_n = 1)

because {Y_n = 1} if and only if {X_n = 0}. Of course, we can also rearrange this equation as {\mathrm P(X_n = 1) = 1 - \mathrm P(Y_n = 1)}.

Now, we can use the equation for Case 1 above, but with the appropriate variable names for {B} substituted in, to see that

\displaystyle \mathrm P(Y_n = 1 | Y_0 = 0) = q \sum_{i = 0}^{n - 1} (1 - q - p)^i,

and it then follows that

\displaystyle \mathrm P(X_n = 1 | X_0 = 1) = 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i.

Anyway, you may have noticed that the sum

\displaystyle \sum_{i = 0}^{n - 1} (1 - p - q)^i

which appears in both of the equations for {\mathrm P(X_n = 1)} is a geometric progression whose common ratio is {1 - p - q}. If {1 - p - q = 1}, then {p + q = 0} and therefore {p = q = 0} (because {p} and {q} are probabilities, and therefore non-negative). The probability {\mathrm P(X_n = 1)} is then simply constant at 0 if {X_0 = 0} (because gain is impossible) and constant at 1 if {X_0 = 1} (because loss is impossible). Outside of this very trivial case, we have {1 - p - q \ne 1}, and therefore the geometric progression may be written as a fraction as per the well-known formula:

\displaystyle \begin{aligned} \sum_{i = 0}^{n - 1} (1 - p - q)^i &= \frac {1 - (1 - p - q)^n} {1 - (1 - p - q)} \\ &= \frac {1 - (1 - p - q)^n} {p + q}. \end{aligned}

It follows that

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= \frac {p (1 - (1 - p - q)^n)} {p + q}, \\ \mathrm P(X_n = 1 | X_0 = 1) &= 1 - \frac {q (1 - (1 - p - q)^n)} {p + q} \\ &= \frac {p + q - q (1 - (1 - p - q)^n)} {p + q} \\ &= \frac {p + q - q + q (1 - p - q)^n} {p + q} \\ &= \frac {p + q (1 - p - q)^n} {p + q}. \end{aligned}

From these equations it is easy to see the limiting behaviour of the gain-loss process as the number of events approaches {\infty}. If {1 - p - q = -1}, then {p + q = 2} and therefore {p = q = 1} (because {p} and {q} are probabilities, and therefore not greater than 1). The equations in this case reduce to

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= \frac {1 - (-1)^n} 2, \\ \mathrm P(X_n = 1 | X_0 = 1) &= \frac {1 + (-1)^n} 2, \end{aligned}

which show that the state simply alternates deterministically back and forth between positive and negative (because {(1 - (-1)^n)/2} is 0 if {n} is even and 1 if {n} is odd and {(1 + (-1)^n)/2} is 1 if {n} is even and 0 if {n} is odd).

Otherwise, we have {|1 - p - q| < 1} and therefore

\displaystyle \lim_{n \rightarrow \infty} (1 - p - q)^n = 0.

Now the equations for {\mathrm P(X_n = 1 | X_0 = 0)} and {\mathrm P(X_n = 1 | X_0 = 1)} above are the same apart from the term in the numerator which contains {(1 - p - q)^n} as a factor, as well as another factor which is independent of {n}. Therefore, regardless of the value of {X_0},

\displaystyle \lim_{k \rightarrow \infty} \mathrm P(X_k = 1) = \frac p {p + q}.

This is a nice result: if {n} is sufficiently large, the dependence of {X_n} on {X_0}, {X_1}, … and {X_{n - 1}} is negligible and its success probability is negligibly different from {p/(p + q)}. That it is this exact quantity sort of makes sense: it’s the ratio of the gain rate to the theoretical rate of change of state in either direction that we would get if both a gain and loss could occur in a single event.

In case you like graphs, here’s a graph of the process with {X_0 = 0}, {p = 1/100}, {q = 1/50} and 500 events. The x-axis is the number of events that have occurred and the y-axis is the observed frequency, divided by 1000, of the state being positive after this number of events has occurred (for the blue line) or the probability of the state being positive according to the equations described in this post (for the green line). If you want to, you can view the Python code that I used to generate this graph (which is actually capable of simulating multiple-trait interactions, although I haven’t tried solving it in that case) on GitHub.


2. The continuous process

2.1. The problem

Let us now consider the same process, but continuous rather than discrete. That is, rather than the gains and losses occuring over the course of a discrete sequence of events, we now have a continuous interval in time, during which at any point losses and gains might occur instantaneously. The state of the process at time {t} shall be denoted {X(t)}. Although multiple gains and losses may occur during an arbitrary subinterval, we may assume for the purpose of approximation that during sufficiently short subintervals only one gain or loss, or none, may occur, and the probabilities of gain and loss are directly proportional to the length of the subinterval. Let {\lambda} be the constant of proportionality for gain and let {\mu} be the constant of proportionality for loss. These are the continuous model’s analogues of the {p} and {q} parameters in the discrete model. Note that they may be greater than 1, unlike {p} and {q}.

2.2. The solution

Suppose {t} is a non-negative real number and {n} is a positive integer. Let {\Delta t = 1/n}. The interval in time from time 0 to time {t} can be divided up into {n} subintervals of length {\Delta t}. If {\Delta t} is small enough, so that the approximating assumptions described in the previous paragraph can be made, then the subintervals can be regarded as discrete events, during each of which gain occurs with probability {\lambda \Delta t} if the state at the start point of the subinterval is negative and loss occurs with probability {\mu \Delta t} if the state at the start point of the subinterval is positive. For every positive integer {k} between 0 and {n} inclusive, let {Y_k} denote the state of this discrete approximation of the process at time {t + k \Delta t}. Then for every integer {k} between 0 and {n} (inclusive) we have

\displaystyle \begin{aligned} \mathrm P(Y_k = 1 | Y_0 = 0) &= \frac {\lambda \Delta t (1 - (1 - \lambda \Delta t - \mu \Delta t)^k)} {\lambda \Delta t + \mu \Delta t}, \\ \mathrm P(Y_k = 1 | Y_0 = 1) &= \frac {\lambda \Delta t + \mu \Delta t (1 - \lambda \Delta t - \mu \Delta t)^k} {\lambda \Delta t + \mu \Delta t}, \end{aligned}

provided {\lambda} and {\mu} are not both equal to 0 (in which case, just as in the discrete case, the state remains constant at whatever the initial state was).

Many of the {\Delta t} factors in this equation can be cancelled out, giving us

\displaystyle \begin{aligned} \mathrm P(Y_k = 1 | Y_0 = 0) &= \frac {\lambda (1 - (1 - (\lambda + \mu) \Delta t)^k)} {\lambda + \mu}, \\ \mathrm P(Y_k = 1 | Y_0 = 1) &= \frac {\lambda + \mu (1 - (\lambda + \mu) \Delta t)^k} {\lambda + \mu}. \end{aligned}

Now consider the case where {k = n} in the limit {n} approaches {\infty}. Note that {\Delta t} approaches 0 at the same time, because {\Delta t = t/n}, and therefore the limit of {(1 - (\lambda + \mu) \Delta t)^n} is not simply 0 as in the discrete case. If we rewrite the expression as

\displaystyle \left( 1 - \frac {t (\lambda + \mu)} n \right)^n

and make the substitution {n = -mt(\lambda + \mu)}, giving us

\displaystyle \left( 1 + \frac 1 m \right)^{-mt(\lambda + \mu)} = \left( \left( 1 + \frac 1 m \right)^m \right)^{-t(\lambda + \mu)},

then we see that the limit is in fact {e^{-t(\lambda + \mu)}}, an exponential function of {t}. It follows that

\displaystyle \begin{aligned} \mathrm P(X(t) = 1 | X(0) = 0) = \lim_{n \rightarrow \infty} \mathrm P(Y_n = 1 | Y_0 = 0) &= \frac {\lambda (1 - e^{-t(\lambda + \mu)})} {\lambda + \mu}, \\ \mathrm P(X(t) = 1 | X(0) = 1) = \lim_{n \rightarrow \infty} \mathrm P(Y_n = 1 | Y_0 = 1) &= \frac {\lambda + \mu e^{-t(\lambda + \mu)}} {\lambda + \mu}. \end{aligned}

This is a pretty interesting result. I initially thought that the continuous process would just have the solution {\mathrm P(X_n = 1) = \lambda/{\lambda + \mu}}, completely independent of {X_0} and {t}, based on the idea that it could be viewed as a discrete process with an infinitely large number of events within every interval of time, so that it would constantly behave like the discrete process does in the limit as the number of events approaches infinity. In fact it turns out that it still behaves like the discrete process, with the effect of the initial state never quite disappearing—although it does of course disappear in the limit as {t} approaches {\infty}, because {e^{-t(\lambda + \mu)}} approaches 0:

\displaystyle \lim_{t \rightarrow \infty} \mathrm P(X(t) = 1) = \frac {\lambda} {\lambda + \mu}.


Greenberg’s Universal 38 and its diachronic implications

This post is being written hastily and therefore may be more incomprehensible than usual.

Greenberg’s Universal 38 says:

Where there is a case system, the only case which ever has only zero allomorphs is the one which includes among its meanings that of the subject of the intransitive verb. (Greenberg, 1963, 59)

A slightly better way to put this (more clearly clarifying what the universal says about languages that code case by means of non-concatenative morphology) might be: if a language makes a distinction between nominative/absolutive and ergative/accusative case by means of concatenative morphology, then there is always at least one ergative/accusative suffix form with nonzero phonological substance. Roughly, there’s a preference for there to be an ergative/accusative affix rather than a nominative/absolutive affix (but it’s OK if there are phonologically substantive affixes for both cases, or if ergative/accusative is zero-coded in some but not all environments).

On the other hand, Greenberg’s statement of the universal makes clear a rather interesting property of it: if you’re thinking about which argument can be zero-coded in a transitive sentence, Universal 38 actually says that it depends on what happens in an intransitive sentence: the one which can be zero-coded is the one which takes the same case that arguments in intransitive sentences take. If the language is accusative, then the nominative, the agenty argument, can be zero-coded, and the accusative, the patienty argument, can’t. If the language is ergative, then the absolutive, the patienty argument, can be zero-coded, and the ergative argument, can’t. (I mean can’t as in can’t be zero-coded in all environments.)

This is a problem, perhaps, for those who think of overt coding preferences and other phenomena related to “markedness” (see Haspelmath, 2006, for a good discussion of the meaning of markedness in linguistics) as related to the semantics of the category values in question. Agenty vs. patienty is the semantic classification of the arguments, but depending on the morphosyntactic alignment of the language, it can be either the agenty or patienty arguments which are allowed to be zero-coded. This seems like a case where Haspelmath’s preferred explanation of all phenomena related to markedness—differences in usage frequency—is much more preferable, although I don’t think he mentions it in his paper (but I might have missed it—I’m not going to check, because I’m trying to not spend too long writing this post).

Anyway, one thing I wonder about this universal (and a thing it’s generally interesting to wonder about with respect to any universal) is how it’s diachronically preserved. For it’s quite easy to imagine ways in which a language could end up in a situation where it has a zero-coded nominative/absolutive due to natural changes. Let’s say it has both cases overtly coded to start with; let’s say the nominative suffix is -ak and the accusative suffix is -an. Now final -n gets lost, with compensatory nasalization, and then vowels in absolute word-final position get elided. (That’s a perfectly natural sequence of sound changes; it happened in the history of English, cf. Proto-Indo-European *yugóm > Proto-Germanic *juką > English yoke.) The language would then end up with nominative -ak and a zero-coded accusative, thus violating Universal 38. So… well, I don’t actually know how absolute Universal 38 is, perhaps it has some exceptions (though I don’t know of any), and if there are enough exceptions we might be able to just say that it’s these kinds of developments that are responsible for the exceptions. But if the exceptions are very few, then there’s probably some way in which languages which end up with zero-coded accusatives like this are hastily “corrected” to keep them in line with the universal. Otherwise we’d expect to see more exceptions. Here’s one interesting question: how would that correction happen? It could just be that a postposition gets re-accreted or something and the accusative ends up being overtly coded once again. But it could also be that subjects of intransitive sentences start not getting the -ak suffix added to them, so that you get a shift from accusative to ergative morphosyntactic alignment, with the zero-coded accusative becoming a perfectly Universal 38-condoned zero-coded absolutive. That’d be pretty cool: a shift in morphosyntactic alignment triggered simply by a coincidence of sound change. Is any such development attested? Somebody should have it happen in a conlang family.

According to Wichmann (2009), morphosyntactic alignment is a “stable” feature which might be a problem if alignment shifts can occur in the manner described above. But then again, I wonder how common overt coding of both nominative/absolutive and ergative/accusative is, actually—most Indo-European languages that mark cases have it, but I did a quick survey of some non-IE languages with case marking, both accusative (Finnish, Hungarian, Turkish, Tamil, Quechua) and ergative (Basque, Dyirbal) and they all seem to code nominative/absolutive by zero (well, Basque codes absolutive overtly in one of its declensions, but not in the other two). If it’s pretty rare for both to be overtly coded, then this correction doesn’t have to happen very often, but it would surely need to happen sometimes if Universal 38 is absolute or close to it.


Greenberg, J. H., 1963. Some universals of grammar with particular reference to the order of meaningful elements. In Greenberg, J. H. (ed.), Universals of Language, 73–113. MIT Press.

Haspelmath, M. (2006). Against markedness (and what to replace it with). Journal of linguistics, 42(01), 25–70.

Wichmann, S. & Holman, E. W. (2009). Temporal stability of linguistic typological features. Retrieved from