# Tag Archives: probability

## Pure and mixed strategies in prediction

Consider the following very simple game: a Bernoulli trial (a trial which results in one of two possible outcomes, labelled “success” and “failure”) is carried out with success probability $p$. Beforehand, you are told the value of $p$ and asked to give a definite prediction of the trial’s outcome. That is, you have to predict either success or failure; just saying “the probability of success is $p$” is not enough. You win if and only if you predict the correct outcome.

Here are two reasonable-sounding strategies for this game:

1. If $p > 0.5$, predict success. If $p < 0.5$, predict failure. If $p = 0.5$, predict success with probability 0.5 and failure with probability 0.5.
2. Predict success with probability $p$ and failure with probability $1 - p$.

In game-theoretic language, the difference between strategies 1 and 2 is that strategy 1 involves the use of a pure strategy if possible, i.e. one in which the choice of what to predict is made deterministically, while strategy 2 is always mixed, i.e. the choice of what to predict is made randomly.

But which is better? Note that the answer may depend on the value of $p$. Try to think about it for a minute before moving on to the next paragraph.

If $p = 0.5$, then the strategies are identical and therefore both equally good.

If $p \ne 0.5$, let $q$ be the probability of the more probable outcome (i.e. $p$ if $p > 0.5$ and $1 - p$ if $p < 0.5$). If the more probable outcome happens, then you win for sure under strategy 1 but you only have probability $q$ of winning under strategy 2. If the less probable outcome happens, then you lose for sure under strategy 1 but you still have probability $1 - q$ of winning under strategy 2. Therefore the probability of winning is $q \cdot 1 + (1 - q) \cdot 0 = q$ under strategy 1 and $q \cdot q + (1 - q) \cdot (1 - q) = 1 - 2q(1 - q)$ under strategy 2. So strategy 1 is better than strategy 2 if and only if

$\displaystyle q > 1 - 2q(1 - q),$

i.e.

$\displaystyle 3q - 2q^2 - 1 > 0.$

This quadratic inequality holds if and only if $0.5 < q < 1$. But $q$ is the probability of the more probable outcome, and therefore $q > 0.5$ for sure. Therefore, strategy 1 is always better if $p \ne 0.5$.

I find this result weird and a little counterintuitive when it’s stated so abstractly. It seems to me like the most natural way of obtaining a definite value from the distribution—drawing randomly from the distribution—should be the best one.

But I guess it does makes sense, if you think about it as applying to a concrete situation. For example, if you were on a jury and you thought there was a $1/1024$ probability that the defendant was guilty, it would be crazy to then flip 10 coins and precommit to arguing for the defendant’s guilt if every one of them came up heads. The other jurors would think you were mad (and probably be very angry with you, if they did all come up heads).

The result has interesting implications for how people should act on their beliefs. If you believe that degrees of belief can be usefully modelled as probabilities, and you try to apply this in everyday reasoning, you will often be faced with the problem of deciding whether to act in accordance with a belief’s truth even if you only place a certain probability $p$ on that belief being true. Should you always act in accordance with the belief if $p > 0.5$, or should you have probability $p$ of acting in accordance with it at any given time? Until I wrote this post it wasn’t obvious to me, but the result in this post suggests you should do the former.

I do wonder if there is anything strategy 2 is good for, though. Comment if you have an idea!

## Modelling communication systems

One of the classes I’m taking this term is about modelling the evolution of communication systems. Everything in the class is done via simulation, which is probably the best way to do it, and certainly necessary at the point where it starts to involve genetic algorithms and such. However, some of the earlier content in the class dealt with problems that I suspected were solvable by a purely mathematical approach, so as somebody with a maths degree I felt it necessary to rise to the challenge and try to derive the solutions mathematically. This post is my attempt to do that.

Let us begin by thinking very abstractly about a system which takes something in and gives something out. Suppose there is a finite, positive number m of things which may be taken in (possible inputs), which we shall call input 1, input 2, … and input m. Suppose likewise that there is a finite, positive number n of things which may be given out (possible outputs), which we shall call output 1, output 2, … and output n.

One way in which the behavior of such a system could be modelled is as a straightforward mapping from inputs to outputs. However, this might be too deterministic: perhaps the system doesn’t always output the same output for a given input. So let’s use a more general model, and think of the system as a mapping from inputs to probability distributions over outputs. For every pair (i, j) of integers such that 0 ≤ im and 0 ≤ jn, let pi, j denote the probability that input i is mapped to output j. The mapping as a whole is determined by the mn probabilities of the form pi, j, and therefore it can be thought of as an m-by-n matrix A:

$\displaystyle \mathbf A = \left( \begin{matrix} p_{1, 1} & p_{1, 2} & \hdots & p_{1, n} \\ p_{2, 1} & p_{2, 2} & \hdots & p_{2, n} \\ \vdots & \vdots & \ddots & \vdots \\ p_{m, 1} & p_{m, 2} & \hdots & p_{m, n} \end{matrix} \right).$

The rows of A correspond to the possible inputs and the columns of A correspond to the possible outputs. Probabilities are non-negative real numbers, so A is a non-negative real matrix. Also, the probabilities of mutually exclusive, exhaustive outcomes sum to 1, so the sum of each row of A is 1. This condition can be expressed as a system of linear equations:

\displaystyle \begin{aligned} p_{1, 1} &+ p_{1, 2} &+ \hdots &+ p_{1, n} &= 1 \\ p_{2, 1} &+ p_{2, 2} &+ \hdots &+ p_{2, n} &= 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ p_{m, 1} &+ p_{m, 2} &+ \hdots &+ p_{m, n} &= 1. \end{aligned}

Alternatively, and more compactly, it may be expressed as the matrix equation

$\displaystyle (1) \quad \mathbf A \mathbf x = \mathbf y,$

where x is the n-dimensional vector whose components are all equal to 1 and y is the m-dimensional vector whose components are all equal to 1.

In general, if x is an n-dimensional vector, and we think of x as a random variable determined by the output of the system, then Ax is the vector of expected values of x conditional on each input. That is, for every integer i such that 1 ≤ im, the ith component of Ax is the expected value of x conditional on meaning i being the input to the system.

Accordingly, if we have not just one, but p n-dimensional vectors x1, x2, … and xp (where p is a positive integer), we can think of these p vectors as the columns of an n-by-p matrix B, and then we can read off all the expected values from the matrix product

$\displaystyle \mathbf A \mathbf B = \mathbf A \mathbf x_1 + \mathbf A \mathbf x_2 + \dotsb + \mathbf A \mathbf x_n$

like so: for every pair (i, k) of integers such that 0 ≤ im and 0 ≤ kp, the (i, k) entry of AB is the expected value of xk conditional on meaning i being the input to the system.

In the case where B happens to be another non-negative real matrix such that

$\displaystyle \mathbf B \mathbf x = \mathbf y,$

so that the entries of B can be interpreted as probabilities, the matrix B as a whole can be interpreted as another input-output system whose possible inputs happen to be the same as the possible outputs of A. In order to emphasize this identity, let us now call the possible outputs of A (= the possible inputs of B) the signals: signal 1, signal 2, … and signal n. The other things—the possible inputs of A, and the possible outputs of B—can be thought of as meanings. Note that there is no need at the moment for the input meanings (the possible inputs of A) to be the same as the output meanings (the possible outputs of B); we make a distinction between the input meanings and the output meanings.

Together, A and B can be thought of as comprising a “product system” which works like this: an input meaning goes into A, a signal comes out of A, the signal goes into B, and an output meaning comes out of B. For every integer k such that 0 ≤ kp, the random variable xk (the kth column of B) can now be interpreted as the probability of the product system outputting output meaning k, as a random variable whose value is determined by the signal. That is, for every integer j such that 0 ≤ jn, the jth component of xk (the (j, k) entry of B) is the probability of output meaning k coming out if the signal happens to be signal j. It follows by the law of total probability that the probability of output meaning k coming out, if i is the input meaning, is the expected value of xk conditional on i being the input meaning. Now, by what we said a couple of paragraphs above, we have that for every integer i such that 0 ≤ im, the expected value of xk conditional on i being the input meaning is the (i, k) entry of AB. So the “product system”, as a matrix, is the matrix product AB. That’s why we call it the “product system”, see? 🙂

In the case where the possible input meanings are the same as the possible output meanings and m = p, we may think about the “product system” as a communicative dyad. The speaker is A, the hearer is B. The speaker is trying to express a meaning, the input meaning, and producing a signal in order to do so, and the hearer is interpreting that signal to have some meaning, the output meaning. The output meaning the hearer understands is not necessarily the same as the input meaning the speaker was trying to express. If it is different, we may regard the communication as unsuccessful; if it is the same, we may regard the communication as successful.

The key question is: what is the probability that the communication is successful? Given the considerations above, it’s very easy to answer. If the input meaning is i, we’re just looking for the probability that output meaning i given this input meaning. That probability is simply the (i, i) entry of AB, i.e. the ith entry along AB‘s main diagonal.

What if the input meaning isn’t fixed? Then the answer will in general depend on the probability distribution over the possible input meanings. But in the simplest case, where the distribution is uniform (no input meaning is any more probable than any other), the probability of successful communication is just the mean of the input meaning-specific probabilities, that is, the sum of the main diagonal entries of AB, divided by m (the number of the main diagonal entries, i.e. the number of meanings). In linear algebra, we call the sum of the main diagonal entries of a square matrix its trace, and we denote it by tr(C) where C is the matrix. So our formula for the communication success probability p is

$\displaystyle (2) \quad p = \frac {\mathrm{tr}(\mathbf A \mathbf B)} m.$

If the probability distribution over the input meanings isn’t uniform, the probability of successful communication is just the weighted average of the input meaning-specific probabilities, with the weights being the respective input meaning probabilities. The general formula can therefore be written as

$(3) \quad \displaystyle p = \mathrm{tr}(\mathbf A \mathbf B \mathbf D) = \mathrm{tr}(\mathbf D \mathbf A \mathbf B)$

where D is the diagonal matrix of size m whose main diagonal is the probability distribution over the input meanings (i.e. for every integer i such that 0 ≤ im, the ith diagonal entry of D is the probability of input meaning i being the one the speaker tries to express). It doesn’t matter whether D is left-multiplied or right-multiplied, because the trace of the product is the same in either case. In the case where the probability distribution over the input meanings is uniform the diagonal entries of D are all equal to $1/m$, i.e $\mathbf D = \mathbf I_m/m$, where Im is the identity matrix of size m, and therefore (3) reduces to (2).

To leave you fully convinced that this formula works, here are some simulations. The 5 graphs below were generated using a Python script which you can view on GitHub. Each one involves 3 possible meanings, 3 possible signals, randomly-generated speaker and hearer matrices and a randomly-generated probability distribution over the input meanings. If you look at the code, you’ll see that the blue line is generated by simulating communication in the obvious way, by randomly drawing an input meaning, randomly drawing a signal based on that particular input meaning, and finally randomly drawing an output meaning based on that particular signal. The position on the x-axis corresponds to the number of trials (individual simulated communicative acts) carried out so far and the position on the y-axis corresponds to the proportion of those trials involving a successful communication (one where the output meaning ended up being the same as the input meaning). For each graph, there were 10 sets of 500 trials; each individual set of trials corresponds to one of the light blue lines, while the darker blue lines gives the results averaged over those ten sets. The horizontal green line indicates the success probability as calculated by our formula. This should be close to the success proportion for a large number of trials, so we should see the blue and green lines converging on the right side of each graph. That is what we see, so the formula works.

## Bonferroni’s inequalities and the inclusion-exclusion principle

A non-negative real- or ∞-valued mapping f on a field $\mathcal F$ of sets is said to be finitely additive if and only if for every pair of disjoint sets A and B in $\mathcal F$, we have

$\displaystyle f(A \cup B) = f(A) + f(B).$

The most important examples of finitely additive mappings are measures, including probability measures, although not every finitely additive mapping is a measure (measures are mappings on σ-algebras, which are a special sort of field of sets, that are countably additive, which is a stronger property than finite additivity).

From the definition it is immediately evident that finite additivity allows us to express the value under a mapping f on a field $\mathcal F$ of sets of any binary union of pairwise disjoint sets in $\mathcal F$ in terms of the values under f of the individual sets. In fact, the same can be said for unions of any arity, provided they are pairwise disjoint. For every field $\mathcal F$ of sets, every finitely additive mapping f on $\mathcal F$, every $n \in \mathbb Z_0$[1] and every n-tuple (A1, A2, …, An) of pairwise disjoint sets in $\mathcal F$, we have

$\displaystyle (1) \quad f \left( \bigcup_{k = 1}^n A_k \right) = \sum_{k = 1}^n A_k.$

This can be proven by induction. In the base case of n = 0, we have

\displaystyle \begin{aligned} f(\emptyset) &= f(\emptyset \cup \emptyset) \\ &= f(\emptyset) + f(\emptyset) \end{aligned}

which implies that 0 = f(∅). Now, suppose $m \in \mathbb Z_0$ and (1) holds for every (m + 1)-tuple (A1, A2, …, Am + 1) of pairwise disjoint sets in $\mathcal F$. Then

\displaystyle \begin{aligned} f \left( \bigcup_{k = 1}^{m + 1} A_k \right) &= f \left( \left( \bigcup_{k = 1}^m A_k \right) \cup A_{m + 1} \right) \\ &= f \left( \bigcup_{k = 1}^m A_k \right) + f(A_{m + 1}) \\ &= \sum_{k = 1}^m f(A_k) + f(A_{m + 1}) \\ &= \sum_{k = 1}^{m + 1} f(A_k) \end{aligned}

which completes the proof.

But what about unions of sets in $\mathcal F$ that are not necessarily pairwise disjoint? Can the values under f of such unions be expressed in terms of the values under f of the individual sets then? The answer is no. However, such unions’ values under f can be expressed in terms of the values under f of the individual sets and their intersections, by what is known as the inclusion-exclusion principle. For every $n \in \mathbb Z_0$ and every $(A_1, A_2, \dotsc, A_n) \in \mathcal F^{\times n}$, we have

$\displaystyle (2) \quad f \left( \bigcup_{k = 1}^n A_k \right) = \sum_{\substack{S \subseteq \mathbb Z_1^n \\ S \ne \emptyset}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right).$

The sum on the right-hand side of (2) is a rather complicated one, cumbersome to write down as well as computationally expensive to compute by virtue of its large number of terms (one for every non-empty subset of $\mathbb Z_1^n$, and there are 2n − 1 of those). Therefore, it is also convenient to use Bonferroni’s inequalities, which say that for every $m \in \mathbb Z_0$, we have

$\displaystyle (3) \quad f \left( \bigcup_{k = 1}^n A_k \right) \gtreqless \sum_{\substack{S \subseteq \mathbb Z_1^n \\ 0 < \#S \le m}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} n A_k \right),$

with the sign $\gtreqless$ standing for “greater than or equal to, if m is even; less than or equal to, if m is odd”. The sum on the right-hand side of (3) has terms only for every non-empty subset of $\mathbb Z_1^n$ with no more than m members, and there are only 2m − 1 of those. In particular, if m = 1 the terms are just the values under f of the individual sets A1, A2, … and An and therefore we have

$\displaystyle f \left( \bigcup_{k = 1}^n A_k \right) \le \sum_{k = 1}^n f(A_k),$

which is Boole’s inequality.

Note that Bonferroni’s inequalities hold when mn as well as when m < n. When mn, the sum on the RHS of (3) is exactly the same as the sum on the RHs of (2). Because there are both even and odd integers m such that mn, and any quantity which is both less than or equal to and greater than or equal to another quantity has to be equal to it, it follows that Bonferroni’s inequalities imply that the equation (2) holds and thus generalize the inclusion-exclusion principle.

In order to prove that Bonferroni’s inequalities hold, and thus prove the inclusion-exclusion principle, we can use induction. First, consider the case where m = 0. The only subset of $\mathbb Z_1^n$ of cardinality 0 is the empty one, so in this case the sum on the right-hand side of (3) is empty and (3) therefore reduces to the statement that

$\displaystyle f \left( \bigcup_{k = 1}^n A_k \right) \ge 0,$

which is true because f is non-negative- or ∞-valued.

Now, suppose $M \in \mathbb Z_0$ and Bonferroni’s inequalities hold in the case where m = M, and consider the case where m = M + 1. We shall use another inductive proof, within the inductive proof we’re currently carrying out, to show that Bonferroni’s inequalities hold for every $n \in \mathbb Z_0$ in when m has this particular value. In the case where n = 0, the left-hand side of (3) reduces to f(∅) and the right-hand side is the empty sum once again, because there are no subsets of $\mathbb Z_1^0$ ($\mathbb Z_1^0$ is the set of the integers greater than or equal to 1 and less than or equal to 0, and there are of course no such integers). Because f(∅) = 0 it follows that (3) is true, regardless of the direction of the inequality required.

As for the successor case, suppose that $N \in \mathbb Z_0$ and Bonferroni’s inequalities hold in the case where n = N. Consider the case where n = N + 1. It is helpful at this point to write (3) for the given values of m and n as

$\displaystyle (-1)^{M + 1} (f(A) - a) \ge 0,$

where

\displaystyle \begin{aligned} A &= \bigcup_{k = 1}^{N + 1} A_k, \\ a &= \sum_{\substack{S \subseteq \mathbb Z_1^{N + 1} \\ 0 < \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right). \end{aligned}

If we also let

\displaystyle \begin{aligned} B &= \bigcup_{k = 1}^N A_k, \\ b &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right), \end{aligned}

then we have (−1)M + 1(f(B) − b) ≥ 0 because Bonferroni’s inequalities hold in the case where n = N. Now, how are f(A) and a related to f(B) and b?

We obviously have A = BAN + 1, but B and AN + 1 are not necessarily disjoint so this doesn’t immediately tell us anything about the relationship of f(A) and f(B). However, AN + 1 is certainly disjoint from B \ AN + 1, so we have

$(4) \quad \displaystyle f(A) = f(B \setminus A_{N + 1}) + f(A_{N + 1}).$

That’s about all we can usefully say for the moment. As for a and b, well, the terms of b are a subset of those of a so it’s quite easy to write down the difference ab. If we manipulate that difference a little bit, we can start getting it to look like something that could occur on the right-hand side of (3).

\displaystyle \begin{aligned} a - b &= \sum_{\substack{S \subseteq \mathbb Z_1^{N + 1} \\ N + 1 \in S \\ \#S \le M + 1}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \right) \\ &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ \#S \le M}} (-1)^{\#S + 2} f \left( \left( \bigcap_{k \in S} A_k \right) \cap A_{N + 1} \right) \\ &= f(A_{N + 1}) - \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \cap A_{N + 1} \right) \end{aligned}

Let

\displaystyle \begin{aligned} C &= \bigcup_{k = 1}^N A_k \cap A_{N + 1}, \\ c &= \sum_{\substack{S \subseteq \mathbb Z_1^N \\ 0 < \#S \le M}} (-1)^{\#S + 1} f \left( \bigcap_{k \in S} A_k \cap A_{N + 1} \right). \end{aligned}

Then we have (−1)M(f(C) − c) ≥ 0, because Bonferroni’s inequalities hold in the case where m = M, and we have ab = f(AN + 1) − c.

Now, if we use the distributivity of intersection over union we can rewrite C as BAN + 1. It follows that B is the disjoint union of C and the set B \ AN + 1 which turned up above when we were contemplating the relationship of f(A) and f(B), and therefore we have f(B) = f(C) + f(B \ AN + 1). Using this new equation we may rewrite (4) as

$(4) \quad \displaystyle f(A) = f(B) - f(C) + f(A_{N + 1}),$

from which it follows that f(A) − f(B) = f(AN + 1) − f(C)—a nicely analogous equation to ab = f(AN + 1) − c. Finally, let us add the two quantities (−1)M + 1(f(B) − b) and (−1)M(f(C) − c) ≥ 0 which we know to be non-negative. The sum of two non-negative quantities is non-negative also, so we have

\displaystyle \begin{aligned} 0 &\le (-1)^{M + 1} (f(B) - b) + (-1)^M (f(C) - c) \\ &= (-1)^{M + 1} (f(B) - b - (f(C) - c)) \\ &= (-1)^{M + 1} (f(B) - f(C) - (b - c)) \\ &= (-1)^{M + 1} (f(A) - f(A_{N + 1}) - (a - f(A_{N + 1}))) \\ &= (-1)^{M + 1} (f(A) - a). \end{aligned}

$\blacksquare$

[1] For every $(a, b) \in \mathbb Z^{\times 2}$, I denote $\{n \in \mathbb Z : a \le n \le b\}$, $\{n \in \mathbb Z : a \le n\}$ and $\{n \in \mathbb Z : n \le b\}$ by $\mathbb Z_a^b$, $\mathbb Z_a$ and $\mathbb Z^b$ respectively.

## A very simple stochastic model of diachronic change

1. The discrete process

1.1. The problem

Consider an entity (for example, a language) which may or may not have a particular property (for example, obligatory coding of grammatical number). For convenience and interpretation-neutrality, we shall say that the entity is positive if it has this property and negative if it does not have this property. Consider the entity as it changes over the course of a number of events (for example, transmissions of the language from one generation to another) in which the entity’s state (whether it is positive or negative) may or may not change. For every nonnegative integer ${n}$, let ${X_n}$ represent the entity’s state after exactly ${n}$ events have occurred, with negativity being represented by 0 and positivity being represented by 1. The initial state ${X_0}$ is a constant parameter of the model, but the states at other times are random variable whose “success” probabilities (i.e. values of 1 under their probability mass functions) are determined by ${X_0}$ and the other parameters of the model.

The other parameters of the model, besides ${X_0}$, are denoted by ${p}$ and ${q}$. These represent the probabilities that an event will change the state from negative to positive or from positive to negative, respectively. They are assumed to be constant across events—this assumption can be thought of as an interpretation of the uniformitarian principle familiar from historical linguistics and other fields. I shall call a change of state from negative to positive a gain and a change of state from positive to negative a loss, so that ${p}$ can be thought of as the gain rate per event and ${q}$ can be thought of as the loss rate per event.

Note that the gain resp. loss probability is ${p}$/${q}$ only if the state is negative resp. positive as the event begins. If the state is already positive resp. negative as the event begins then it is impossible for a further gain resp. loss to occur and therefore the gain resp. loss probability is 0 (but the loss resp. gain probability is ${q}$/${p}$). Thus the random variables ${X_1}$, ${X_2}$, ${X_3}$, … are not necessarily independent of one another.

I am aware that there’s a name for a sequence of random variables that are not necessarily independent of one another, namely “stochastic process”. However, that is about the extent of what I know about stochastic processes. I think the thing I’m talking about in this post is a very simple example of a stochastic process–an appropriate name for it would be the gain-loss process. If you know something about stochastic processes it might seem very trivial, but it was an interesting problem for me to try to figure out knowing nothing already about stochastic processes.

1.2. The solution

Suppose ${n}$ is a nonnegative integer and consider the state ${X_{n + 1}}$ after exactly ${n + 1}$ events have occurred. If the entity is negative as the ${(n + 1)}$th event begins, the probability of gain during the ${(n + 1)}$th event is ${p}$. If the entity is positive as the ${(n + 1)}$th event begins, the probability of loss during the ${(n + 1)}$th event is ${q}$. Now, as the ${(n + 1)}$th event begins, exactly ${n}$ events have already occurred. Therefore the probability that the entity is negative as the ${(n + 1)}$th event begins is ${\mathrm P(X_n = 0)}$ and the probability that the entity is positive as the ${(n + 1)}$th event begins is ${\mathrm P(X_n = 1)}$. It follows by the law of total probability that

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1) &= p (1 - \mathrm P(X_n = 1)) + (1 - q) \mathrm P(X_n = 1) \\ &= p - p \mathrm P(X_n = 1) + \mathrm P(X_n = 1) - q \mathrm P(X_n = 1) \\ &= p - (p - 1 + q) \mathrm P(X_n = 1) \\ &= p + (1 - p - q) \mathrm P(X_n = 1). \end{aligned}

This recurrence relation can be solved using the highly sophisticated method of “use it to find general equations for the first few terms in the sequence, extrapolate the pattern, and confirm that the extrapolation is valid using a proof by induction”. I’ll spare you the laborious first phrase, and just show you the second and third. The solution is

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= p \sum_{i = 0}^{n - 1} (1 - p - q)^i, \\ \mathrm P(X_n = 1 | X_0 = 1) &= 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i. \end{aligned}

Just so you can check that this is correct, the proofs by induction for the separate cases are given below.

Case 1 (${X_0 = 0)}$. Base case. The expression

$\displaystyle p \sum_{i = 0}^{n - 1} (1 - p - q)^i$

evaluates to 0 if ${n = 0}$, because the sum is empty.

Successor case. For every nonnegative integer ${n}$ such that

$\displaystyle \mathrm P(X_n = 1 | X_0 = 0) = p \sum_{i = 0}^{n - 1} (1 - p - q)^i,$

we have

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1 | X_0 = 0) &= p + (1 - p + q) \mathrm P(X_n = 1 | X_0 = 0) \\ &= p + (1 - p - q) p \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= p + p (1 - p - q) \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= p \left( 1 + \sum_{i = 0}^{n - 1} (1 - p - q)^{i + 1} \right) \\ &= p \sum_{j = 0}^n (1 - p - q)^j. \end{aligned}

Case 2 (${X_0 = 1}$). Base case. The expression

$\displaystyle 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i$

evaluates to 1 if ${n = 0}$, because the sum is empty.

Successor case. For every nonnegative integer ${n}$ such that

$\displaystyle \mathrm P(X_n = 1 | X_0 = 1) = 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i,$

we have

\displaystyle \begin{aligned} \mathrm P(X_{n + 1} = 1 | X_0 = 1) &= p + (1 - p + q) \mathrm P(X_n = 1 | X_0 = 1) \\ &= p + (1 - p - q) \left( 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i \right) \\ &= p + 1 - p - q - (1 - p - q) q \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= 1 - q - q (1 - p - q) \sum_{i = 0}^{n - 1} (1 - p - q)^i \\ &= 1 - q \left( 1 + \sum_{i = 0}^{n - 1} (1 - p - q)^{i + 1} \right) \\ &= 1 - q \sum_{j = 0}^n (1 - p - q)^j. \end{aligned}

I don’t know if there is any way to make sense of why exactly these equations are the way they are; if you have any ideas, I’d be interested to hear your comments. There is a nice way I can see of understanding the difference between the two cases. Consider an additional gain-loss process ${B}$ which changes in tandem with the gain-loss process ${A}$ that we’ve been considering up till just now, so that its state is always the opposite of that of ${A}$. Then the gain rate of ${B}$ is ${q}$ (because if ${B}$ gains, ${A}$ loses) and the lose rate of ${B}$ is ${p}$ (because if ${B}$ loses, ${A}$ gains). And for every nonnegative integer ${n}$, if we let ${Y_n}$ denote the state of ${B}$ after exactly ${n}$ events have occurred, then

$\displaystyle \mathrm P(Y_n = 1) = 1 - \mathrm P(X_n = 1)$

because ${Y_n = 1}$ if and only if ${X_n = 0}$. Of course, we can also rearrange this equation as ${\mathrm P(X_n = 1) = 1 - \mathrm P(Y_n = 1)}$.

Now, we can use the equation for Case 1 above, but with the appropriate variable names for ${B}$ substituted in, to see that

$\displaystyle \mathrm P(Y_n = 1 | Y_0 = 0) = q \sum_{i = 0}^{n - 1} (1 - q - p)^i,$

and it then follows that

$\displaystyle \mathrm P(X_n = 1 | X_0 = 1) = 1 - q \sum_{i = 0}^{n - 1} (1 - p - q)^i.$

Anyway, you may have noticed that the sum

$\displaystyle \sum_{i = 0}^{n - 1} (1 - p - q)^i$

which appears in both of the equations for ${\mathrm P(X_n = 1)}$ is a geometric progression whose common ratio is ${1 - p - q}$. If ${1 - p - q = 1}$, then ${p + q = 0}$ and therefore ${p = q = 0}$ (because ${p}$ and ${q}$ are probabilities, and therefore non-negative). The probability ${\mathrm P(X_n = 1)}$ is then simply constant at 0 if ${X_0 = 0}$ (because gain is impossible) and constant at 1 if ${X_0 = 1}$ (because loss is impossible). Outside of this very trivial case, we have ${1 - p - q \ne 1}$, and therefore the geometric progression may be written as a fraction as per the well-known formula:

\displaystyle \begin{aligned} \sum_{i = 0}^{n - 1} (1 - p - q)^i &= \frac {1 - (1 - p - q)^n} {1 - (1 - p - q)} \\ &= \frac {1 - (1 - p - q)^n} {p + q}. \end{aligned}

It follows that

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= \frac {p (1 - (1 - p - q)^n)} {p + q}, \\ \mathrm P(X_n = 1 | X_0 = 1) &= 1 - \frac {q (1 - (1 - p - q)^n)} {p + q} \\ &= \frac {p + q - q (1 - (1 - p - q)^n)} {p + q} \\ &= \frac {p + q - q + q (1 - p - q)^n} {p + q} \\ &= \frac {p + q (1 - p - q)^n} {p + q}. \end{aligned}

From these equations it is easy to see the limiting behaviour of the gain-loss process as the number of events approaches ${\infty}$. If ${1 - p - q = -1}$, then ${p + q = 2}$ and therefore ${p = q = 1}$ (because ${p}$ and ${q}$ are probabilities, and therefore not greater than 1). The equations in this case reduce to

\displaystyle \begin{aligned} \mathrm P(X_n = 1 | X_0 = 0) &= \frac {1 - (-1)^n} 2, \\ \mathrm P(X_n = 1 | X_0 = 1) &= \frac {1 + (-1)^n} 2, \end{aligned}

which show that the state simply alternates deterministically back and forth between positive and negative (because ${(1 - (-1)^n)/2}$ is 0 if ${n}$ is even and 1 if ${n}$ is odd and ${(1 + (-1)^n)/2}$ is 1 if ${n}$ is even and 0 if ${n}$ is odd).

Otherwise, we have ${|1 - p - q| < 1}$ and therefore

$\displaystyle \lim_{n \rightarrow \infty} (1 - p - q)^n = 0.$

Now the equations for ${\mathrm P(X_n = 1 | X_0 = 0)}$ and ${\mathrm P(X_n = 1 | X_0 = 1)}$ above are the same apart from the term in the numerator which contains ${(1 - p - q)^n}$ as a factor, as well as another factor which is independent of ${n}$. Therefore, regardless of the value of ${X_0}$,

$\displaystyle \lim_{k \rightarrow \infty} \mathrm P(X_k = 1) = \frac p {p + q}.$

This is a nice result: if ${n}$ is sufficiently large, the dependence of ${X_n}$ on ${X_0}$, ${X_1}$, … and ${X_{n - 1}}$ is negligible and its success probability is negligibly different from ${p/(p + q)}$. That it is this exact quantity sort of makes sense: it’s the ratio of the gain rate to the theoretical rate of change of state in either direction that we would get if both a gain and loss could occur in a single event.

In case you like graphs, here’s a graph of the process with ${X_0 = 0}$, ${p = 1/100}$, ${q = 1/50}$ and 500 events. The x-axis is the number of events that have occurred and the y-axis is the observed frequency, divided by 1000, of the state being positive after this number of events has occurred (for the blue line) or the probability of the state being positive according to the equations described in this post (for the green line). If you want to, you can view the Python code that I used to generate this graph (which is actually capable of simulating multiple-trait interactions, although I haven’t tried solving it in that case) on GitHub.

2. The continuous process

2.1. The problem

Let us now consider the same process, but continuous rather than discrete. That is, rather than the gains and losses occuring over the course of a discrete sequence of events, we now have a continuous interval in time, during which at any point losses and gains might occur instantaneously. The state of the process at time ${t}$ shall be denoted ${X(t)}$. Although multiple gains and losses may occur during an arbitrary subinterval, we may assume for the purpose of approximation that during sufficiently short subintervals only one gain or loss, or none, may occur, and the probabilities of gain and loss are directly proportional to the length of the subinterval. Let ${\lambda}$ be the constant of proportionality for gain and let ${\mu}$ be the constant of proportionality for loss. These are the continuous model’s analogues of the ${p}$ and ${q}$ parameters in the discrete model. Note that they may be greater than 1, unlike ${p}$ and ${q}$.

2.2. The solution

Suppose ${t}$ is a non-negative real number and ${n}$ is a positive integer. Let ${\Delta t = 1/n}$. The interval in time from time 0 to time ${t}$ can be divided up into ${n}$ subintervals of length ${\Delta t}$. If ${\Delta t}$ is small enough, so that the approximating assumptions described in the previous paragraph can be made, then the subintervals can be regarded as discrete events, during each of which gain occurs with probability ${\lambda \Delta t}$ if the state at the start point of the subinterval is negative and loss occurs with probability ${\mu \Delta t}$ if the state at the start point of the subinterval is positive. For every positive integer ${k}$ between 0 and ${n}$ inclusive, let ${Y_k}$ denote the state of this discrete approximation of the process at time ${t + k \Delta t}$. Then for every integer ${k}$ between 0 and ${n}$ (inclusive) we have

\displaystyle \begin{aligned} \mathrm P(Y_k = 1 | Y_0 = 0) &= \frac {\lambda \Delta t (1 - (1 - \lambda \Delta t - \mu \Delta t)^k)} {\lambda \Delta t + \mu \Delta t}, \\ \mathrm P(Y_k = 1 | Y_0 = 1) &= \frac {\lambda \Delta t + \mu \Delta t (1 - \lambda \Delta t - \mu \Delta t)^k} {\lambda \Delta t + \mu \Delta t}, \end{aligned}

provided ${\lambda}$ and ${\mu}$ are not both equal to 0 (in which case, just as in the discrete case, the state remains constant at whatever the initial state was).

Many of the ${\Delta t}$ factors in this equation can be cancelled out, giving us

\displaystyle \begin{aligned} \mathrm P(Y_k = 1 | Y_0 = 0) &= \frac {\lambda (1 - (1 - (\lambda + \mu) \Delta t)^k)} {\lambda + \mu}, \\ \mathrm P(Y_k = 1 | Y_0 = 1) &= \frac {\lambda + \mu (1 - (\lambda + \mu) \Delta t)^k} {\lambda + \mu}. \end{aligned}

Now consider the case where ${k = n}$ in the limit ${n}$ approaches ${\infty}$. Note that ${\Delta t}$ approaches 0 at the same time, because ${\Delta t = t/n}$, and therefore the limit of ${(1 - (\lambda + \mu) \Delta t)^n}$ is not simply 0 as in the discrete case. If we rewrite the expression as

$\displaystyle \left( 1 - \frac {t (\lambda + \mu)} n \right)^n$

and make the substitution ${n = -mt(\lambda + \mu)}$, giving us

$\displaystyle \left( 1 + \frac 1 m \right)^{-mt(\lambda + \mu)} = \left( \left( 1 + \frac 1 m \right)^m \right)^{-t(\lambda + \mu)},$

then we see that the limit is in fact ${e^{-t(\lambda + \mu)}}$, an exponential function of ${t}$. It follows that

\displaystyle \begin{aligned} \mathrm P(X(t) = 1 | X(0) = 0) = \lim_{n \rightarrow \infty} \mathrm P(Y_n = 1 | Y_0 = 0) &= \frac {\lambda (1 - e^{-t(\lambda + \mu)})} {\lambda + \mu}, \\ \mathrm P(X(t) = 1 | X(0) = 1) = \lim_{n \rightarrow \infty} \mathrm P(Y_n = 1 | Y_0 = 1) &= \frac {\lambda + \mu e^{-t(\lambda + \mu)}} {\lambda + \mu}. \end{aligned}

This is a pretty interesting result. I initially thought that the continuous process would just have the solution ${\mathrm P(X_n = 1) = \lambda/{\lambda + \mu}}$, completely independent of ${X_0}$ and ${t}$, based on the idea that it could be viewed as a discrete process with an infinitely large number of events within every interval of time, so that it would constantly behave like the discrete process does in the limit as the number of events approaches infinity. In fact it turns out that it still behaves like the discrete process, with the effect of the initial state never quite disappearing—although it does of course disappear in the limit as ${t}$ approaches ${\infty}$, because ${e^{-t(\lambda + \mu)}}$ approaches 0:

$\displaystyle \lim_{t \rightarrow \infty} \mathrm P(X(t) = 1) = \frac {\lambda} {\lambda + \mu}.$

## Some thoughts on generating random members of infinite sets

Here’s a challenge for you. Think of a random natural number (i.e. one of the numbers in the sequence 0, 1, 2, …). If you think about this, this is a rather difficult task, because of the fact that there are infinitely many natural numbers, but if the choice is random, then every natural number should have a non-zero chance of being chosen. In fact, I can give you an argument that it is impossible. Assuming there is a positive lower bound on the length of time it takes to think of a natural number, then, since there are only finitely many minds in the universe, there are only finitely many natural numbers which have ever been thought of. And the random natural number you thought of must have been one of these natural numbers. So there are infinitely many natural numbers—the ones which have never been thought of—which you were never going to choose. Their chance of being chosen was zero, so the choice was not truly random. In fact, it was non-random in a pretty serious way: not just one or two natural numbers had zero chance of being chosen, but infinitely many natural numbers had zero chance of being chosen.

It is possible to generate a random natural number using an algorithm, provided you have some way of generating random natural numbers up to some upper bound. For example, here’s one algorithm you could use.

1. Let n be 0.
2. Generate a random natural number which is either 0 or 1, and let x be this number.
3. If x is 0, stop and return $n$. Otherwise, add 1 to n and repeat from step 2.

For every natural number n, this algorithm will generate n if and only if 1 is generated in step 2 n times, so the probability that it will generate n is 2−(n + 1) (since the probability that 1 will be generated in step 2 is 1/2).

Of course, you could ‘think of’ a natural number by performing this algorithm in your head, which kind of invalidates the argument above. We can rescue the argument by noting that, if you perform the algorithm in your head, you will need time to perform each step. Therefore, if you are required to think of a natural number within a given length of time, you will only have time to repeat from step 2 a finite number of times. Let’s call this number of times n. Then there will not be enough time for you to generate n + 1, or any greater natural number, using the algorithm.

Now, although this algorithm can generate every natural number, it doesn’t generate them all with the same probability. 0 is twice as likely to be generated as 1, which in turn is twice as likely to be generated as 2, and so on. In fact, it is not actually possible to devise an algorithm that generates every natural number with the same probability. Suppose you have an algorithm and for every natural number n, the probability that the algorithm will generate n upon execution is Pn. Then the infinite sum Σn ∈ ℕ Pn has to be 1, because the algorithm always generates some natural number. But if the value of Pn does not depend on the value of n, this is impossible: the sum evaluates as 0 if the value of Pn is always 0, and the sum evaluates as ∞ if the value of Pn is always a real number greater than 0. On the other hand, we do have that Σn ∈ ℕ 2−(n + 1) = 1, so that’s why the algorithm above does work.

## Explaining the Monty Hall problem

Yesterday, after reading this interesting article, I think I finally gained a proper understanding of the Monty Hall problem. Although it’s a very famous problem, in case you haven’t seen it before, I’ll state the problem below. If it is new to you, before reading on, you can have a think about what the answer might be.

So, suppose you’re on a game show where there are three doors. One of the doors holds a prize behind it, the others don’t hold anything. Your goal is simply to open the door with the prize behind it. The interesting part is, when you pick a door, you don’t open it right away: instead your host (Monty Hall) opens another door, different from the one you picked. Monty knows which door has the prize behind it, and he never opens that door. So then there are only two unopened doors remaining, one of which you picked, and one of which has the prize behind it. Monty asks you if you’d like to pick the other door instead. If you do so, are you any more likely to open the door with the prize behind it?

I will state the answer in the next paragraph.

When you first pick from the three doors, you know nothing about which door has the prize behind it. All the doors are equally likely to have the prize behind them; the probability that you have the right door is, therefore, 1/3. This should be pretty obvious. What many people don’t realise is that after Monty opens one of the doors, you do know something about which door has the prize behind it. Let’s call the door you picked originally door 1, the door Monty opened door 3, and the remaining door door 2. Door 3 does not have the prize behind it, and in fact, Monty would never have picked it if it did have the prize behind it. Therefore, the fact that Monty didn’t open door 2 is evidence in favour of it having the prize behind it. On the other hand, Monty was never considering opening door 1, because he was never going to open the door you picked. So you have evidence in favour of door 2 having the prize behind it, and no evidence in favour of door 1 having the prize behind it. Therefore, you should switch to door 2. You might also wonder how much more likely are you to get the prize. Well, the probability that the prize is not behind door 1 is 2/3. And since Monty opened door 3, if the prize is not behind door 1, it must be behind door 2. So the probability that the prize is behind door 2 is 2/3.

This problem became well-known after a reader posed it to Marilyn vos Savant in a letter to Parade magazine in 1990. She gave the correct answer, but later received hundreds of letters from readers saying she was wrong (some of them rather rudely), and that it isn’t helpful to switch. The crucial thing which these readers failed to see is that Monty is giving you extra information by opening one of the doors. If you didn’t know that Monty would only open a door which you hadn’t picked and which didn’t have the prize behind it, then you would not have this extra information, and as far as you knew, switching would not increase your probability of being correct.