Tag Archives: Mathematics

Pure and mixed strategies in prediction

Consider the following very simple game: a Bernoulli trial (a trial which results in one of two possible outcomes, labelled “success” and “failure”) is carried out with success probability p. Beforehand, you are told the value of p and asked to give a definite prediction of the trial’s outcome. That is, you have to predict either success or failure; just saying “the probability of success is p” is not enough. You win if and only if you predict the correct outcome.

Here are two reasonable-sounding strategies for this game:

  1. If p > 0.5, predict success. If p < 0.5, predict failure. If p = 0.5, predict success with probability 0.5 and failure with probability 0.5.
  2. Predict success with probability p and failure with probability 1 - p.

In game-theoretic language, the difference between strategies 1 and 2 is that strategy 1 involves the use of a pure strategy if possible, i.e. one in which the choice of what to predict is made deterministically, while strategy 2 is always mixed, i.e. the choice of what to predict is made randomly.

But which is better? Note that the answer may depend on the value of p. Try to think about it for a minute before moving on to the next paragraph.


If p = 0.5, then the strategies are identical and therefore both equally good.

If p \ne 0.5, let q be the probability of the more probable outcome (i.e. p if p > 0.5 and 1 - p if p < 0.5). If the more probable outcome happens, then you win for sure under strategy 1 but you only have probability q of winning under strategy 2. If the less probable outcome happens, then you lose for sure under strategy 1 but you still have probability 1 - q of winning under strategy 2. Therefore the probability of winning is q \cdot 1 + (1 - q) \cdot 0 = q under strategy 1 and q \cdot q + (1 - q) \cdot (1 - q) = 1 - 2q(1 - q) under strategy 2. So strategy 1 is better than strategy 2 if and only if

\displaystyle q > 1 - 2q(1 - q),

i.e.

\displaystyle 3q - 2q^2 - 1 > 0.

This quadratic inequality holds if and only if 0.5 < q < 1. But q is the probability of the more probable outcome, and therefore q > 0.5 for sure. Therefore, strategy 1 is always better if p \ne 0.5.

I find this result weird and a little counterintuitive when it’s stated so abstractly. It seems to me like the most natural way of obtaining a definite value from the distribution—drawing randomly from the distribution—should be the best one.

But I guess it does makes sense, if you think about it as applying to a concrete situation. For example, if you were on a jury and you thought there was a 1/1024 probability that the defendant was guilty, it would be crazy to then flip 10 coins and precommit to arguing for the defendant’s guilt if every one of them came up heads. The other jurors would think you were mad (and probably be very angry with you, if they did all come up heads).

The result has interesting implications for how people should act on their beliefs. If you believe that degrees of belief can be usefully modelled as probabilities, and you try to apply this in everyday reasoning, you will often be faced with the problem of deciding whether to act in accordance with a belief’s truth even if you only place a certain probability p on that belief being true. Should you always act in accordance with the belief if p > 0.5, or should you have probability p of acting in accordance with it at any given time? Until I wrote this post it wasn’t obvious to me, but the result in this post suggests you should do the former.

I do wonder if there is anything strategy 2 is good for, though. Comment if you have an idea!

Modelling communication systems

One of the classes I’m taking this term is about modelling the evolution of communication systems. Everything in the class is done via simulation, which is probably the best way to do it, and certainly necessary at the point where it starts to involve genetic algorithms and such. However, some of the earlier content in the class dealt with problems that I suspected were solvable by a purely mathematical approach, so as somebody with a maths degree I felt it necessary to rise to the challenge and try to derive the solutions mathematically. This post is my attempt to do that.


Let us begin by thinking very abstractly about a system which takes something in and gives something out. Suppose there is a finite, positive number m of things which may be taken in (possible inputs), which we shall call input 1, input 2, … and input m. Suppose likewise that there is a finite, positive number n of things which may be given out (possible outputs), which we shall call output 1, output 2, … and output n.

One way in which the behavior of such a system could be modelled is as a straightforward mapping from inputs to outputs. However, this might be too deterministic: perhaps the system doesn’t always output the same output for a given input. So let’s use a more general model, and think of the system as a mapping from inputs to probability distributions over outputs. For every pair (i, j) of integers such that 0 ≤ im and 0 ≤ jn, let pi, j denote the probability that input i is mapped to output j. The mapping as a whole is determined by the mn probabilities of the form pi, j, and therefore it can be thought of as an m-by-n matrix A:

\displaystyle \mathbf A = \left( \begin{matrix} p_{1, 1} & p_{1, 2} & \hdots & p_{1, n} \\ p_{2, 1} & p_{2, 2} & \hdots & p_{2, n} \\ \vdots & \vdots & \ddots & \vdots \\ p_{m, 1} & p_{m, 2} & \hdots & p_{m, n} \end{matrix} \right).

The rows of A correspond to the possible inputs and the columns of A correspond to the possible outputs. Probabilities are non-negative real numbers, so A is a non-negative real matrix. Also, the probabilities of mutually exclusive, exhaustive outcomes sum to 1, so the sum of each row of A is 1. This condition can be expressed as a system of linear equations:

\displaystyle \begin{aligned} p_{1, 1} &+ p_{1, 2} &+ \hdots &+ p_{1, n} &= 1 \\ p_{2, 1} &+ p_{2, 2} &+ \hdots &+ p_{2, n} &= 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ p_{m, 1} &+ p_{m, 2} &+ \hdots &+ p_{m, n} &= 1. \end{aligned}

Alternatively, and more compactly, it may be expressed as the matrix equation

\displaystyle (1) \quad \mathbf A \mathbf x = \mathbf y,

where x is the n-dimensional vector whose components are all equal to 1 and y is the m-dimensional vector whose components are all equal to 1.

In general, if x is an n-dimensional vector, and we think of x as a random variable determined by the output of the system, then Ax is the vector of expected values of x conditional on each input. That is, for every integer i such that 1 ≤ im, the ith component of Ax is the expected value of x conditional on meaning i being the input to the system.

Accordingly, if we have not just one, but p n-dimensional vectors x1, x2, … and xp (where p is a positive integer), we can think of these p vectors as the columns of an n-by-p matrix B, and then we can read off all the expected values from the matrix product

\displaystyle \mathbf A \mathbf B = \mathbf A \mathbf x_1 + \mathbf A \mathbf x_2 + \dotsb + \mathbf A \mathbf x_n

like so: for every pair (i, k) of integers such that 0 ≤ im and 0 ≤ kp, the (i, k) entry of AB is the expected value of xk conditional on meaning i being the input to the system.

In the case where B happens to be another non-negative real matrix such that

\displaystyle \mathbf B \mathbf x = \mathbf y,

so that the entries of B can be interpreted as probabilities, the matrix B as a whole can be interpreted as another input-output system whose possible inputs happen to be the same as the possible outputs of A. In order to emphasize this identity, let us now call the possible outputs of A (= the possible inputs of B) the signals: signal 1, signal 2, … and signal n. The other things—the possible inputs of A, and the possible outputs of B—can be thought of as meanings. Note that there is no need at the moment for the input meanings (the possible inputs of A) to be the same as the output meanings (the possible outputs of B); we make a distinction between the input meanings and the output meanings.

Together, A and B can be thought of as comprising a “product system” which works like this: an input meaning goes into A, a signal comes out of A, the signal goes into B, and an output meaning comes out of B. For every integer k such that 0 ≤ kp, the random variable xk (the kth column of B) can now be interpreted as the probability of the product system outputting output meaning k, as a random variable whose value is determined by the signal. That is, for every integer j such that 0 ≤ jn, the jth component of xk (the (j, k) entry of B) is the probability of output meaning k coming out if the signal happens to be signal j. It follows by the law of total probability that the probability of output meaning k coming out, if i is the input meaning, is the expected value of xk conditional on i being the input meaning. Now, by what we said a couple of paragraphs above, we have that for every integer i such that 0 ≤ im, the expected value of xk conditional on i being the input meaning is the (i, k) entry of AB. So the “product system”, as a matrix, is the matrix product AB. That’s why we call it the “product system”, see? 🙂

In the case where the possible input meanings are the same as the possible output meanings and m = p, we may think about the “product system” as a communicative dyad. The speaker is A, the hearer is B. The speaker is trying to express a meaning, the input meaning, and producing a signal in order to do so, and the hearer is interpreting that signal to have some meaning, the output meaning. The output meaning the hearer understands is not necessarily the same as the input meaning the speaker was trying to express. If it is different, we may regard the communication as unsuccessful; if it is the same, we may regard the communication as successful.

The key question is: what is the probability that the communication is successful? Given the considerations above, it’s very easy to answer. If the input meaning is i, we’re just looking for the probability that output meaning i given this input meaning. That probability is simply the (i, i) entry of AB, i.e. the ith entry along AB‘s main diagonal.

What if the input meaning isn’t fixed? Then the answer will in general depend on the probability distribution over the possible input meanings. But in the simplest case, where the distribution is uniform (no input meaning is any more probable than any other), the probability of successful communication is just the mean of the input meaning-specific probabilities, that is, the sum of the main diagonal entries of AB, divided by m (the number of the main diagonal entries, i.e. the number of meanings). In linear algebra, we call the sum of the main diagonal entries of a square matrix its trace, and we denote it by tr(C) where C is the matrix. So our formula for the communication success probability p is

\displaystyle (2) \quad p = \frac {\mathrm{tr}(\mathbf A \mathbf B)} m.

If the probability distribution over the input meanings isn’t uniform, the probability of successful communication is just the weighted average of the input meaning-specific probabilities, with the weights being the respective input meaning probabilities. The general formula can therefore be written as

(3) \quad \displaystyle p = \mathrm{tr}(\mathbf A \mathbf B \mathbf D) = \mathrm{tr}(\mathbf D \mathbf A \mathbf B)

where D is the diagonal matrix of size m whose main diagonal is the probability distribution over the input meanings (i.e. for every integer i such that 0 ≤ im, the ith diagonal entry of D is the probability of input meaning i being the one the speaker tries to express). It doesn’t matter whether D is left-multiplied or right-multiplied, because the trace of the product is the same in either case. In the case where the probability distribution over the input meanings is uniform the diagonal entries of D are all equal to 1/m, i.e \mathbf D = \mathbf I_m/m, where Im is the identity matrix of size m, and therefore (3) reduces to (2).

To leave you fully convinced that this formula works, here are some simulations. The 5 graphs below were generated using a Python script which you can view on GitHub. Each one involves 3 possible meanings, 3 possible signals, randomly-generated speaker and hearer matrices and a randomly-generated probability distribution over the input meanings. If you look at the code, you’ll see that the blue line is generated by simulating communication in the obvious way, by randomly drawing an input meaning, randomly drawing a signal based on that particular input meaning, and finally randomly drawing an output meaning based on that particular signal. The position on the x-axis corresponds to the number of trials (individual simulated communicative acts) carried out so far and the position on the y-axis corresponds to the proportion of those trials involving a successful communication (one where the output meaning ended up being the same as the input meaning). For each graph, there were 10 sets of 500 trials; each individual set of trials corresponds to one of the light blue lines, while the darker blue lines gives the results averaged over those ten sets. The horizontal green line indicates the success probability as calculated by our formula. This should be close to the success proportion for a large number of trials, so we should see the blue and green lines converging on the right side of each graph. That is what we see, so the formula works.

g5g4g3g2g1

Bonferroni’s inequalities and the inclusion-exclusion principle

A non-negative real- or ∞-valued mapping f on a field \mathcal F of sets is said to be finitely additive if and only if for every pair of disjoint sets A and B in \mathcal F, we have

\displaystyle f(A \cup B) = f(A) + f(B).

The most important examples of finitely additive mappings are measures, including probability measures, although not every finitely additive mapping is a measure (measures are mappings on σ-algebras, which are a special sort of field of sets, that are countably additive, which is a stronger property than finite additivity).

From the definition it is immediately evident that finite additivity allows us to express the value under a mapping f on a field \mathcal F of sets of any binary union of pairwise disjoint sets in \mathcal F in terms of the values under f of the individual sets. In fact, the same can be said for unions of any arity, provided they are pairwise disjoint. For every field \mathcal F of sets, every finitely additive mapping f on \mathcal F, every n \in \mathbb Z_0[1] and every n-tuple (A1, A2, …, An) of pairwise disjoint sets in \mathcal F, we have

\displaystyle (1) \quad f \left( \bigcup_{k = 1}^n A_k \right) = \sum_{k = 1}^n A_k.

This can be proven by induction. In the base case of n = 0, we have

\displaystyle \begin{aligned} f(\emptyset) &= f(\emptyset \cup \emptyset) \\ &= f(\emptyset) + f(\emptyset) \end{aligned}

which implies that 0 = f(∅). Now, suppose m \in \mathbb Z_0 and (1) holds for every (m + 1)-tuple (A1, A2, …, Am + 1) of pairwise disjoint sets in \mathcal F. Then

\displaystyle \begin{aligned} f \left( \bigcup_{k = 1}^{m + 1} A_k \right) &= f \left( \left( \bigcup_{k = 1}^m A_k \right) \cup A_{m + 1} \right) \\ &= f \left( \bigcup_{k = 1}^m A_k \right) + f(A_{m + 1}) \\ &= \sum_{k = 1}^m f(A_k) + f(A_{m + 1}) \\ &= \sum_{k = 1}^{m + 1} f(A_k) \end{aligned}

which completes the proof.

But what about unions of sets in \mathcal F that are not necessarily pairwise disjoint? Can the values under f of such unions be expressed in terms of the values under f of the individual sets then? The answer is no. However, such unions’ values under f can be expressed in terms of the values under f of the individual sets and their intersections, by what is known as the inclusion-exclusion principle. For every n \in \mathbb Z_0 and every (A_1, A_2, \dotsc, A_n) \in \mathcal F^{\times n}, we have

\displaystyle (2) \quad f \left( \bigcup_{k = 1}^n A_k \right) = \sum_{\substack{S \subseteq \mathbb Z_1^n \\ S \ne \emptyset}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right).

The sum on the right-hand side of (2) is a rather complicated one, cumbersome to write down as well as computationally expensive to compute by virtue of its large number of terms (one for every non-empty subset of \mathbb Z_1^n, and there are 2n − 1 of those). Therefore, it is also convenient to use Bonferroni’s inequalities, which say that for every m \in \mathbb Z_0, we have

\displaystyle (3) \quad f \left( \bigcup_{k = 1}^n A_k \right) \gtreqless \sum_{\substack{S \subseteq \mathbb Z_1^n \\ 0 < \#S \le m}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} n A_k \right),

with the sign \gtreqless standing for “greater than or equal to, if m is even; less than or equal to, if m is odd”. The sum on the right-hand side of (3) has terms only for every non-empty subset of \mathbb Z_1^n with no more than m members, and there are only 2m − 1 of those. In particular, if m = 1 the terms are just the values under f of the individual sets A1, A2, … and An and therefore we have

\displaystyle f \left( \bigcup_{k = 1}^n A_k \right) \le \sum_{k = 1}^n f(A_k),

which is Boole’s inequality.

Note that Bonferroni’s inequalities hold when mn as well as when m < n. When mn, the sum on the RHS of (3) is exactly the same as the sum on the RHs of (2). Because there are both even and odd integers m such that mn, and any quantity which is both less than or equal to and greater than or equal to another quantity has to be equal to it, it follows that Bonferroni’s inequalities imply that the equation (2) holds and thus generalize the inclusion-exclusion principle.

In order to prove that Bonferroni’s inequalities hold, and thus prove the inclusion-exclusion principle, we can use induction. First, consider the case where m = 0. The only subset of \mathbb Z_1^n of cardinality 0 is the empty one, so in this case the sum on the right-hand side of (3) is empty and (3) therefore reduces to the statement that

\displaystyle f \left( \bigcup_{k = 1}^n A_k \right) \ge 0,

which is true because f is non-negative- or ∞-valued.

Now, suppose M \in \mathbb Z_0 and Bonferroni’s inequalities hold in the case where m = M, and consider the case where m = M + 1. We shall use another inductive proof, within the inductive proof we’re currently carrying out, to show that Bonferroni’s inequalities hold for every n \in \mathbb Z_0 in when m has this particular value. In the case where n = 0, the left-hand side of (3) reduces to f(∅) and the right-hand side is the empty sum once again, because there are no subsets of \mathbb Z_1^0 (\mathbb Z_1^0 is the set of the integers greater than or equal to 1 and less than or equal to 0, and there are of course no such integers). Because f(∅) = 0 it follows that (3) is true, regardless of the direction of the inequality required.

As for the successor case, suppose that N \in \mathbb Z_0 and Bonferroni’s inequalities hold in the case where n = N. Consider the case where n = N + 1. It is helpful at this point to write (3) for the given values of m and n as

\displaystyle (-1)^{M + 1} (f(A) - a) \ge 0,

where

\displaystyle \begin{aligned} A &= \bigcup_{k = 1}^{N + 1} A_k, \\ a &= \sum_{\substack{S \subseteq \mathbb Z_1^{N + 1} \\ 0 < \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right). \end{aligned}

If we also let

\displaystyle \begin{aligned} B &= \bigcup_{k = 1}^N A_k, \\ b &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right), \end{aligned}

then we have (−1)M + 1(f(B) − b) ≥ 0 because Bonferroni’s inequalities hold in the case where n = N. Now, how are f(A) and a related to f(B) and b?

We obviously have A = BAN + 1, but B and AN + 1 are not necessarily disjoint so this doesn’t immediately tell us anything about the relationship of f(A) and f(B). However, AN + 1 is certainly disjoint from B \ AN + 1, so we have

(4) \quad \displaystyle f(A) = f(B \setminus A_{N + 1}) + f(A_{N + 1}).

That’s about all we can usefully say for the moment. As for a and b, well, the terms of b are a subset of those of a so it’s quite easy to write down the difference ab. If we manipulate that difference a little bit, we can start getting it to look like something that could occur on the right-hand side of (3).

\displaystyle \begin{aligned} a - b &= \sum_{\substack{S \subseteq \mathbb Z_1^{N + 1} \\ N + 1 \in S \\ \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right) \\ &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ \#S \le M}} (-1)^{\#S + 2} f \left( \left( \bigcap_{k \in S} A_k \right) \cap A_{N + 1} \right) \\ &= f(A_{N + 1}) - \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \cap A_{N + 1} \right) \end{aligned}

Let

\displaystyle \begin{aligned} C &= \bigcup_{k = 1}^N A_k \cap A_{N + 1}, \\ c &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \cap A_{N + 1} \right). \end{aligned}

Then we have (−1)M(f(C) − c) ≥ 0, because Bonferroni’s inequalities hold in the case where m = M, and we have ab = f(AN + 1) − c.

Now, if we use the distributivity of intersection over union we can rewrite C as BAN + 1. It follows that B is the disjoint union of C and the set B \ AN + 1 which turned up above when we were contemplating the relationship of f(A) and f(B), and therefore we have f(B) = f(C) + f(B \ AN + 1). Using this new equation we may rewrite (4) as

(4) \quad \displaystyle f(A) = f(B) - f(C) + f(A_{N + 1}),

from which it follows that f(A) − f(B) = f(AN + 1) − f(C)—a nicely analogous equation to ab = f(AN + 1) − c. Finally, let us add the two quantities (−1)M + 1(f(B) − b) and (−1)M(f(C) − c) ≥ 0 which we know to be non-negative. The sum of two non-negative quantities is non-negative also, so we have

\displaystyle \begin{aligned} 0 &\le (-1)^{M + 1} (f(B) - b) + (-1)^M (f(C) - c) \\ &= (-1)^{M + 1} (f(B) - b - (f(C) - c)) \\ &= (-1)^{M + 1} (f(B) - f(C) - (b - c)) \\ &= (-1)^{M + 1} (f(A) - f(A_{N + 1}) - (a - f(A_{N + 1}))) \\ &= (-1)^{M + 1} (f(A) - a). \end{aligned}

\blacksquare

[1] For every (a, b) \in \mathbb Z^{\times 2}, I denote \{n \in \mathbb Z : a \le n \le b\}, \{n \in \mathbb Z : a \le n\} and \{n \in \mathbb Z : n \le b\} by \mathbb Z_a^b, \mathbb Z_a and \mathbb Z^b respectively.

A very simple stochastic model of diachronic change

1. The discrete process

1.1. The problem

Consider an entity (for example, a language) which may or may not have a particular property (for example, obligatory coding of grammatical number). For convenience and interpretation-neutrality, we shall say that the entity is positive if it has this property and negative if it does not have this property. Consider the entity as it changes over the course of a number of events (for example, transmissions of the language from one generation to another) in which the entity’s state (whether it is positive or negative) may or may not change. For every nonnegative integer {n}, let {X_n} represent the entity’s state after exactly {n} events have occurred, with negativity being represented by 0 and positivity being represented by 1. The initial state {X_0} is a constant parameter of the model, but the states at other times are random variable whose “success” probabilities (i.e. values of 1 under their probability mass functions) are determined by {X_0} and the other parameters of the model.

The other parameters of the model, besides {X_0}, are denoted by {p} and {q}. These represent the probabilities that an event will change the state from negative to positive or from positive to negative, respectively. They are assumed to be constant across events—this assumption can be thought of as an interpretation of the uniformitarian principle familiar from historical linguistics and other fields. I shall call a change of state from negative to positive a gain and a change of state from positive to negative a loss, so that {p} can be thought of as the gain rate per event and {q} can be thought of as the loss rate per event.

Note that the gain resp. loss probability is {p}/{q} only if the state is negative resp. positive as the event begins. If the state is already positive resp. negative as the event begins then it is impossible for a further gain resp. loss to occur and therefore the gain resp. loss probability is 0 (but the loss resp. gain probability is {q}/{p}). Thus the random variables {X_1}, {X_2}, {X_3}, … are not necessarily independent of one another.

I am aware that there’s a name for a sequence of random variables that are not necessarily independent of one another, namely “stochastic process”. However, that is about the extent of what I know about stochastic processes. I think the thing I’m talking about in this post is a very simple example of a stochastic process–an appropriate name for it would be the gain-loss process. If you know something about stochastic processes it might seem very trivial, but it was an interesting problem for me to try to figure out knowing nothing already about stochastic processes.

1.2. The solution

Suppose {n} is a nonnegative integer and consider the state {X_{n + 1}} after exactly {n + 1} events have occurred. If the entity is negative as the {(n + 1)}th event begins, the probability of gain during the {(n + 1)}th event is {p}. If the entity is positive as the {(n + 1)}th event begins, the probability of loss during the {(n + 1)}th event is {q}. Now, as the {(n + 1)}th event begins, exactly {n} events have already occurred. Therefore the probability that the entity is negative as the {(n + 1)}th event begins is {\mathrm P(X_n = 0)} and the probability that the entity is positive as the {(n + 1)}th event begins is {\mathrm P(X_n = 1)}. It follows by the law of total probability that

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1) &= p (1 - \mathrm P(X_n = 1)) + (1 - q) \mathrm P(X_n = 1) \\ &= p - p \mathrm P(X_n = 1) + \mathrm P(X_n = 1) - q \mathrm P(X_n = 1) \\ &= p - (p - 1 + q) \mathrm P(X_n = 1) \\ &= p + (1 - p - q) \mathrm P(X_n = 1). \end{aligned}

This recurrence relation can be solved using the highly sophisticated method of “use it to find general equations for the first few terms in the sequence, extrapolate the pattern, and confirm that the extrapolation is valid using a proof by induction”. I’ll spare you the laborious first phrase, and just show you the second and third. The solution is

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= p \sum_{i = 0}^{n - 1} (1 - p - q)^i, \\ \mathrm P(X_n = 1 | X_0 = 1) &= 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i. \end{aligned}

Just so you can check that this is correct, the proofs by induction for the separate cases are given below.

Case 1 ({X_0 = 0)}. Base case. The expression

\displaystyle p \sum_{i = 0}^{n - 1} (1 - p - q)^i

evaluates to 0 if {n = 0}, because the sum is empty.

Successor case. For every nonnegative integer {n} such that

\displaystyle \mathrm P(X_n = 1 | X_0 = 0) = p \sum_{i = 0}^{n - 1} (1 - p - q)^i,

we have

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1 | X_0 = 0) &= p + (1 - p + q) \mathrm P(X_n = 1 | X_0 = 0) \\ &= p + (1 - p - q) p \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= p + p (1 - p - q) \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= p \left( 1 + \sum_{i = 0}^{n - 1} (1 - p - q)^{i + 1} \right) \\ &= p \sum_{j = 0}^n (1 - p - q)^j. \end{aligned}

Case 2 ({X_0 = 1}). Base case. The expression

\displaystyle 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i

evaluates to 1 if {n = 0}, because the sum is empty.

Successor case. For every nonnegative integer {n} such that

\displaystyle \mathrm P(X_n = 1 | X_0 = 1) = 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i,

we have

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1 | X_0 = 1) &= p + (1 - p + q) \mathrm P(X_n = 1 | X_0 = 1) \\ &= p + (1 - p - q) \left( 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i \right) \\ &= p + 1 - p - q - (1 - p - q) q \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= 1 - q - q (1 - p - q) \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= 1 - q \left( 1 + \sum_{i = 0}^{n - 1} (1 - p - q)^{i + 1} \right) \\ &= 1 - q \sum_{j = 0}^n (1 - p - q)^j. \end{aligned}

I don’t know if there is any way to make sense of why exactly these equations are the way they are; if you have any ideas, I’d be interested to hear your comments. There is a nice way I can see of understanding the difference between the two cases. Consider an additional gain-loss process {B} which changes in tandem with the gain-loss process {A} that we’ve been considering up till just now, so that its state is always the opposite of that of {A}. Then the gain rate of {B} is {q} (because if {B} gains, {A} loses) and the lose rate of {B} is {p} (because if {B} loses, {A} gains). And for every nonnegative integer {n}, if we let {Y_n} denote the state of {B} after exactly {n} events have occurred, then

\displaystyle \mathrm P(Y_n = 1) = 1 - \mathrm P(X_n = 1)

because {Y_n = 1} if and only if {X_n = 0}. Of course, we can also rearrange this equation as {\mathrm P(X_n = 1) = 1 - \mathrm P(Y_n = 1)}.

Now, we can use the equation for Case 1 above, but with the appropriate variable names for {B} substituted in, to see that

\displaystyle \mathrm P(Y_n = 1 | Y_0 = 0) = q \sum_{i = 0}^{n - 1} (1 - q - p)^i,

and it then follows that

\displaystyle \mathrm P(X_n = 1 | X_0 = 1) = 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i.

Anyway, you may have noticed that the sum

\displaystyle \sum_{i = 0}^{n - 1} (1 - p - q)^i

which appears in both of the equations for {\mathrm P(X_n = 1)} is a geometric progression whose common ratio is {1 - p - q}. If {1 - p - q = 1}, then {p + q = 0} and therefore {p = q = 0} (because {p} and {q} are probabilities, and therefore non-negative). The probability {\mathrm P(X_n = 1)} is then simply constant at 0 if {X_0 = 0} (because gain is impossible) and constant at 1 if {X_0 = 1} (because loss is impossible). Outside of this very trivial case, we have {1 - p - q \ne 1}, and therefore the geometric progression may be written as a fraction as per the well-known formula:

\displaystyle \begin{aligned} \sum_{i = 0}^{n - 1} (1 - p - q)^i &= \frac {1 - (1 - p - q)^n} {1 - (1 - p - q)} \\ &= \frac {1 - (1 - p - q)^n} {p + q}. \end{aligned}

It follows that

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= \frac {p (1 - (1 - p - q)^n)} {p + q}, \\ \mathrm P(X_n = 1 | X_0 = 1) &= 1 - \frac {q (1 - (1 - p - q)^n)} {p + q} \\ &= \frac {p + q - q (1 - (1 - p - q)^n)} {p + q} \\ &= \frac {p + q - q + q (1 - p - q)^n} {p + q} \\ &= \frac {p + q (1 - p - q)^n} {p + q}. \end{aligned}

From these equations it is easy to see the limiting behaviour of the gain-loss process as the number of events approaches {\infty}. If {1 - p - q = -1}, then {p + q = 2} and therefore {p = q = 1} (because {p} and {q} are probabilities, and therefore not greater than 1). The equations in this case reduce to

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= \frac {1 - (-1)^n} 2, \\ \mathrm P(X_n = 1 | X_0 = 1) &= \frac {1 + (-1)^n} 2, \end{aligned}

which show that the state simply alternates deterministically back and forth between positive and negative (because {(1 - (-1)^n)/2} is 0 if {n} is even and 1 if {n} is odd and {(1 + (-1)^n)/2} is 1 if {n} is even and 0 if {n} is odd).

Otherwise, we have {|1 - p - q| < 1} and therefore

\displaystyle \lim_{n \rightarrow \infty} (1 - p - q)^n = 0.

Now the equations for {\mathrm P(X_n = 1 | X_0 = 0)} and {\mathrm P(X_n = 1 | X_0 = 1)} above are the same apart from the term in the numerator which contains {(1 - p - q)^n} as a factor, as well as another factor which is independent of {n}. Therefore, regardless of the value of {X_0},

\displaystyle \lim_{k \rightarrow \infty} \mathrm P(X_k = 1) = \frac p {p + q}.

This is a nice result: if {n} is sufficiently large, the dependence of {X_n} on {X_0}, {X_1}, … and {X_{n - 1}} is negligible and its success probability is negligibly different from {p/(p + q)}. That it is this exact quantity sort of makes sense: it’s the ratio of the gain rate to the theoretical rate of change of state in either direction that we would get if both a gain and loss could occur in a single event.

In case you like graphs, here’s a graph of the process with {X_0 = 0}, {p = 1/100}, {q = 1/50} and 500 events. The x-axis is the number of events that have occurred and the y-axis is the observed frequency, divided by 1000, of the state being positive after this number of events has occurred (for the blue line) or the probability of the state being positive according to the equations described in this post (for the green line). If you want to, you can view the Python code that I used to generate this graph (which is actually capable of simulating multiple-trait interactions, although I haven’t tried solving it in that case) on GitHub.

gainloss-graph

2. The continuous process

2.1. The problem

Let us now consider the same process, but continuous rather than discrete. That is, rather than the gains and losses occuring over the course of a discrete sequence of events, we now have a continuous interval in time, during which at any point losses and gains might occur instantaneously. The state of the process at time {t} shall be denoted {X(t)}. Although multiple gains and losses may occur during an arbitrary subinterval, we may assume for the purpose of approximation that during sufficiently short subintervals only one gain or loss, or none, may occur, and the probabilities of gain and loss are directly proportional to the length of the subinterval. Let {\lambda} be the constant of proportionality for gain and let {\mu} be the constant of proportionality for loss. These are the continuous model’s analogues of the {p} and {q} parameters in the discrete model. Note that they may be greater than 1, unlike {p} and {q}.

2.2. The solution

Suppose {t} is a non-negative real number and {n} is a positive integer. Let {\Delta t = 1/n}. The interval in time from time 0 to time {t} can be divided up into {n} subintervals of length {\Delta t}. If {\Delta t} is small enough, so that the approximating assumptions described in the previous paragraph can be made, then the subintervals can be regarded as discrete events, during each of which gain occurs with probability {\lambda \Delta t} if the state at the start point of the subinterval is negative and loss occurs with probability {\mu \Delta t} if the state at the start point of the subinterval is positive. For every positive integer {k} between 0 and {n} inclusive, let {Y_k} denote the state of this discrete approximation of the process at time {t + k \Delta t}. Then for every integer {k} between 0 and {n} (inclusive) we have

\displaystyle \begin{aligned} \mathrm P(Y_k = 1 | Y_0 = 0) &= \frac {\lambda \Delta t (1 - (1 - \lambda \Delta t - \mu \Delta t)^k)} {\lambda \Delta t + \mu \Delta t}, \\ \mathrm P(Y_k = 1 | Y_0 = 1) &= \frac {\lambda \Delta t + \mu \Delta t (1 - \lambda \Delta t - \mu \Delta t)^k} {\lambda \Delta t + \mu \Delta t}, \end{aligned}

provided {\lambda} and {\mu} are not both equal to 0 (in which case, just as in the discrete case, the state remains constant at whatever the initial state was).

Many of the {\Delta t} factors in this equation can be cancelled out, giving us

\displaystyle \begin{aligned} \mathrm P(Y_k = 1 | Y_0 = 0) &= \frac {\lambda (1 - (1 - (\lambda + \mu) \Delta t)^k)} {\lambda + \mu}, \\ \mathrm P(Y_k = 1 | Y_0 = 1) &= \frac {\lambda + \mu (1 - (\lambda + \mu) \Delta t)^k} {\lambda + \mu}. \end{aligned}

Now consider the case where {k = n} in the limit {n} approaches {\infty}. Note that {\Delta t} approaches 0 at the same time, because {\Delta t = t/n}, and therefore the limit of {(1 - (\lambda + \mu) \Delta t)^n} is not simply 0 as in the discrete case. If we rewrite the expression as

\displaystyle \left( 1 - \frac {t (\lambda + \mu)} n \right)^n

and make the substitution {n = -mt(\lambda + \mu)}, giving us

\displaystyle \left( 1 + \frac 1 m \right)^{-mt(\lambda + \mu)} = \left( \left( 1 + \frac 1 m \right)^m \right)^{-t(\lambda + \mu)},

then we see that the limit is in fact {e^{-t(\lambda + \mu)}}, an exponential function of {t}. It follows that

\displaystyle \begin{aligned} \mathrm P(X(t) = 1 | X(0) = 0) = \lim_{n \rightarrow \infty} \mathrm P(Y_n = 1 | Y_0 = 0) &= \frac {\lambda (1 - e^{-t(\lambda + \mu)})} {\lambda + \mu}, \\ \mathrm P(X(t) = 1 | X(0) = 1) = \lim_{n \rightarrow \infty} \mathrm P(Y_n = 1 | Y_0 = 1) &= \frac {\lambda + \mu e^{-t(\lambda + \mu)}} {\lambda + \mu}. \end{aligned}

This is a pretty interesting result. I initially thought that the continuous process would just have the solution {\mathrm P(X_n = 1) = \lambda/{\lambda + \mu}}, completely independent of {X_0} and {t}, based on the idea that it could be viewed as a discrete process with an infinitely large number of events within every interval of time, so that it would constantly behave like the discrete process does in the limit as the number of events approaches infinity. In fact it turns out that it still behaves like the discrete process, with the effect of the initial state never quite disappearing—although it does of course disappear in the limit as {t} approaches {\infty}, because {e^{-t(\lambda + \mu)}} approaches 0:

\displaystyle \lim_{t \rightarrow \infty} \mathrm P(X(t) = 1) = \frac {\lambda} {\lambda + \mu}.

Dirichlet’s approximation theorem

The definition of rational numbers is usually expressed as follows.

Definition 1 For every real number {x}, {x} is rational if and only if there are integers {p} and {q} such that {q \ne 0} and {x = p/q}.

Remark 1 For every pair of integers {p} and {q} such that {q \ne 0}, {p/q = (p'/\gcd(p, q))/(|q|/\gcd(p, q))}, where {p' = p} if {q > 0} and {p' = -p} if {q < 0}. Therefore, the definition which is the same as Definition 1 except that {q} is required to be positive and {p} is required to be coprime to {q} is equivalent to Definition 1.

However, there’s a slightly different way one can express the definition, which uses the fact that the equations {x = p/q} and {q x = p} are equivalent.

Definition 2 For every real number {x}, {x} is rational if and only if there is a nonzero integer {q} such that {q x} is an integer.

Remark 2 The definition which is the same as Definition 3 except that {q} is required to be positive and {q x} is required to be coprime to {q} is equivalent to Definition 3.

The nice thing about Definition 3 is that it immediately brings to mind the following algorithm for verifying that a real number {x} is rational: iterate through the positive integers in ascending order, and for each positive integer {q} check whether {q x} is an integer. (It’s assumed that it is easy to check whether an arbitrary real number is an integer.) If it is an integer, stop the iteration. The algorithm terminates if and only if {x} is rational. The algorithm is obviously not very useful if it is actually used by a computer to check for rationality—one obvious problem is that it cannot verify irrationality, it can only falsify it. But it is useful as a guuide to thought. Mathematical questions are often easier to think about if they are understood in terms of processes, rather than in terms of relationships between static objects.

In particular, there’s a natural way in which some irrational numbers can be said to be “closer to rational” than others, in terms of this algorithm. If {x} is irrational, then none of the terms in the sequence {\langle x, 2 x, 3 x, \dotsc \rangle} are integers. But how close to integers are the terms? The closer they are to integers, the closer to rational {x} can be said to be.

But how is the closeness of the integers to the terms of the sequence to be measured? There are different ways this can be done. Perhaps the most natural way to start off with is to measure it by the minimum of the distances of the terms from the closest integers to them—that is, the minimum of the set {\{|q x - p|: p \in \mathbf Z, q \in \mathbf N\}}. Of course, this minimum may not even exist—it may be possible to make {|q x - p|} arbitrarily small by choosing appropriate integers {p} and {q} such that {q > 0}. So the first question to answer is this: for which values of {x} does the minimum exist?

The answer to this question is given by Dirichlet’s approximation theorem.

Theorem 3 (Dirichlet’s approximation theorem) For every real number {x} and every positive integer {n}, there are integers {p} and {q} such that {q > 0} and

\displaystyle  |q x - p| < \frac 1 n. \ \ \ \ \ (1)

Proof: First, let us define some notation. For every real number {x}, let {[x]} denote the greatest integer less than or equal to {x} and let {\{x\}} denote {x - [x]}. Note that the inequality {0 \le \{x\} < 1} always holds.

Now, suppose {x} is a real number and {n} is a positive integer. The {n + 1} real numbers 0, {\{x\}}, {\{2 x\}}, … and {\{n x\}} are all in the half-open interval {I = [0, 1)}. This interval can be partitioned into the {n} sub-intervals {I_1 = [0, 1/n}, {I_2 = [1/n, 2/n)}, \dots and {I_n = [1 - 1/n, 1)}, each of length {1/n}. These {n + 1} real numbers are distributed among these {n} sub-intervals, and since there are more real numbers than sub-intervals at least one of the sub-intervals contains more than one of the real numbers. That is, there are integers {r} and {s} such that {\{r x\}} and {\{s x\}} are in the same sub-interval and hence {|\{r x\} - \{s x\}| < 1/n}. Or, equivalently:

\begin{array}{rcl} 1/n &>& \|\{r x\} - \{s x\}| \\  &=& |(r x - [r x]) - (s x - [s x])| \\  &=& |(r - s) x - ([r x] - [s x])|, \end{array}

so if we let {q = r - s} and {p = [r x] - [s x]} we have {|q x - p| < 1/n}. And {r} and {s} can be chosen so that {r < s} and hence {q} is positive. \Box

Dirichlet’s approximation theorem says that for every real number {x}, {q x - p} can be made arbitrarily small by choosing appropriate integers {p} and {q} such that {q > 0}, and hence the minimum of the set {\{|q x - p|: p \in \mathbf Z, q \in \mathbf N\}} does not exist.

It may not be immediately obvious from the way in which it has been presented here why Dirichlet’s approximation theorem is called an “approximation theorem”. The reason is that if the inequality {|q x - p| < 1/n} is divided through by {q} (which produces an equivalent inequality, given that {q} is positive), the result is

\displaystyle  \left| x - \frac p q \right| < \frac 1 {n q}. \ \ \ \ \ (2)

So Dirichlet’s approximation theorem can also be interpreted as saying that for every real number {x} and every positive integer {n}, it is possible to find a rational approximation {p/q} to {x} (where {p} and {q} are integers and {q > 0}) whose error is less than {1/(nq)}. In fact, this is how the theorem is usually presented. When it’s presented in this way, Dirichlet’s approximation theorem can be seen as an addendum to the fact that for every positive integer {n}, it is possible to find a rational approximation {p/q} to {x} whose error is less than {1/n}—that is, arbitrarily small rational approximations exist to every real number. (This is very easily proven—it’s really just another way of expressing the fact that the set of all rational numbers, {\mathbf Q}, is dense in the set of all real numbers, {\mathbf R}.) After obtaining that result, one might naturally think, “well, in this sense all real numbers are equally well approximable by rational numbers, but perhaps if I make the condition more strict by adding a factor of {1/q} into the quantity the error has to be less than, I can uncover some interesting differences in the rational approximability of different real numbers.” But the relevance of Dirichlet’s approximation theorem can also be understood in a more direct way, and that’s what I wanted to show with this post.

Of course putting this extra factor in doesn’t lead to the discovery of any interesting differences in the rational approximability of of different real numbers. In order to get to the interesting differences, you have to add in yet another factor of {1/q}. A real number {x} is said to be well approximable if and only if for every positive integer {n}, there are integers {p} and {q} such that {q > 0} and

\displaystyle  |q x - p| < \frac 1 {n q}, \ \ \ \ \ (3)

or, equivalently,

\displaystyle  \left| x - \frac p q \right| < \frac 1 {n q^2}. \ \ \ \ \ (4)

Otherwise, {x} is said to be badly approximable. Some real numbers are well approximable, and some are badly approximable.

There is in fact a very neat characterisation of the distinction in terms of continued fractions. The real numbers that are well approximable are precisely those that have arbitrarily large terms in their continued fraction expansion. For example, {e} is well-approximable because its continued fraction expansion is

\displaystyle  [2, 1, 2, 1, 1, 4, 1, 1, 6, 1, 1, \dotsc]

(Note that the pattern only appears from the third term onwards, so it’s really {(e - 2)/(3 - e)} that has the interesting continued fraction expansion.) Every multiple of 2 appears in this continued fraction expansion, so there are arbitrarily large terms. The real numbers that are badly approximable, on the other hand, are those that have a maximum term in their continued fraction expansion. They include all quadratic irrational numbers (since those numbers have continued fraction expansions which are eventually periodic), as well as others. For example, the real number with the continued fraction expansion

\displaystyle  [1, 2, 1, 1, 2, 1, 1, 1, 2, \dotsc]

is badly approximable. This distinction is the topic of the final year project I’m currently doing for my mathematics course at university.

I guess it would be possible to motivate the well approximable-badly approximable distinction in a similar way: note that a real number {x} is rational if and only if there is an integer {q} such that {q^2 x} is an integer divisible by {q}, and then go on to say that the closeness of rationality of an irrational number {x} can be judged by how close the terms of the sequence {\langle x, 4 x, 9 x, \dotsc \rangle} are to integers that are multiples of {q}. The well approximable numbers would be those for which there exist terms of the sequence arbitrarily close to integers. Of course, this is a lot more contrived.