Generating random probability distributions

I’ve been thinking about how to use a computer program to randomly generate a probability distribution of a given finite size. This has turned out to be an interesting problem.

My first idea was to generate n − 1 uniform variates in [0, 1] (where n is the desired size), sort them, add 0 to the front of the list and 1 to the back of the list, and then take the non-negative differences between the adjacent variates. In Python code:

def randpd1(size):
variates = [random.random() for i in range(size - 1)]
variates.sort()
return [j - i for i, j in zip([0] + variates, variates + [1])]

My second idea was to simply generate n uniform variates in [0, 1], add them together, and take the ratio of each individual variate to the sum.

def randpd2(size):
variates = [random.random() for i in range(size)]
s = sum(variates)
return [i/s for i in variates]

Both of these functions do reliably generate probability distributions, i.e. lists of non-negative real numbers (encoded as Python float objects) that sum to 1, although very large lists generated by randpd2 sometimes sum to something slightly different from 1 due to floating-point imprecision.

>>> sample1 = [randpd1(2**(i//1000 + 1)) for i in range(10000)]
>>> all(all(p >= 0 for p in pd) for pd in sample1)
True
>>> all(sum(pd) == 1 for pd in sample1)
True
>>> sample2 = [randpd2(2**(i//1000 + 1)) for i in range(10000)]
>>> all(all(p >= 0 for p in pd) for pd in sample2)
True
>>> all(sum(pd) == 1 for pd in sample2)
False
>>> all(math.isclose(sum(pd), 1) for pd in sample2)
True

But do they both generate a random probability distribution? In precise terms: for a given size argument, are the probability distributions of the return values randpd1(size) and randpd2(size) always uniform?

I don’t really know how to answer this question. In fact, it’s not even clear to me that there is a uniform distribution over the probability distributions of size n for every positive integer n. The problem is that the probability distributions of size n are the solutions in $[0, 1]^n$ of the equation $p_1 + p_2 + \dotsb + p_n = 1$, where $p_1$, $p_2$, … and $p_n$ are dummy variables, and therefore they comprise a set S whose dimension is n − 1 (not n). Because S is missing a dimension, continuous probability distributions over it cannot be defined in the usual way via probability density mappings on $\mathbb R^n$. Any such mapping would have to assign probability density 0 to every point in S, because for every such point x, there’s a whole additional dimension’s worth of points in every neighbourhood of x which are not in S. But then the integral of the probability density mapping over S would be 0, not 1, and it would not be a probability density mapping.

But perhaps you can map S onto a subset of $\mathbb R^{n - 1}$, and do something with a uniform distribution over the image. In any case, I’m finding thinking about this very confusing, so I’ll leave it for readers to ponder over. Given that I don’t currently know what a uniform probability distribution over the probability distributions of size n even looks like, I don’t know how to test whether one exists.

I can look at the marginal distributions of the individual items in the returned values of randpd1 and randpd2. But these marginal distributions are not straightforwardly related to the joint distribution of the list as a whole. In particular, uniformity of the joint distribution does not imply uniformity of the marginal distributions, and uniformity of the marginal distributions does not imply uniformity of the joint distribution.

But it’s still interesting to look at the marginal distributions. First of all, they allow validation of another desirable property of the two functions: the marginal distributions are the same for each item (regardless of its position in the list). I’m not going to demonstrate this here because it would be tedious, but it does look like this is the case. Therefore we can speak of “the marginal distribution” without reference to any particular item. Second, they reveal that randpd1 and randpd2 do not do exactly the same thing. The marginal distributions are different for the two functions. Let’s first look just at the case where size is 2.

>>> data1 = [randpd1(2)[0] for i in range(100000)]
>>> plt.hist(data1)

>>> data2 = [randpd2(2)[0] for i in range(100000)]
>>> plt.hist(data2)

The first plot looks like it’s been generated from a uniform distribution over [0, 1]; the second plot looks like it’s been generated from a non-uniform distribution which concentrates the probability density at $1/2$. It’s easy to see why the distribution is uniform for randpd1: the function works by generating a single uniform variate p and then returning [min(p, 1 - p), max(p, 1 - p)], and given that the distribution of p is uniform the distribution of 1 - p is also uniform. The function randpd2, on the other hand, works by generating two uniform variates p + q and returning [p/(p + q), q/(p + q)]. However, I don’t know what the distribution of p/(p + q) and q/(p + q) is exactly, given that p and q are uniformly distributed. This is another thing I hope readers who know more about probability and statistics than me might be able to enlighten me on.

Here are the graphs for size 3:

>>> data1 = [randpd1(3)[0] for i in range(100000)]
>>> plt.hist(data1)

>>> data2 = [randpd2(3)[0] for i in range(100000)]
>>> plt.hist(data2)

The marginal distribution for randpd1 is no longer uniform; rather, it’s right triangular, with the right angle at the point (0, 0) and height 2. That means that, roughly speaking, a given item in the list returned by randpd1(3) is twice as likely to be close to 0 as it is to be close to $1/2$, and half as likely to be close to 1 as it is to be close to $1/2$.

In general, the marginal distribution for randpd1 is the distribution of the minimum of a sample of uniform variates in [0, 1] of size n − 1, where n is the value of size. This is because randpd1 works by generating such a sample, and the minimum of that sample always ends up being the first item in the returned list, and the marginal distributions of the other items are the same as the marginal distribution of the first item.

It turns out to be not too difficult to derive an exact formula for this distribution. For every $x \in [0, 1]$, the minimum is greater than x if and only if all n − 1 variates are greater than x. Therefore the probabilities of these two events are the same. The probability of an individual variate being greater than x is 1 − x (because, given that the variate is uniformly distributed, x is the probability that the variate is less than or equal to x) and therefore, given that the variates are independent of each other, the probability of all being greater than x is $(1 - x)^{n - 1}$. It follows that the probability of the minimum being less than or equal to x is $1 - (1 - x)^{n - 1}$. That is, the cumulative distribution mapping (CDM) f of the marginal distribution for randpd1 is given by

$\displaystyle f(x) = 1 - (1 - x)^{n - 1}.$

The probability distribution defined by this CDM is a well-known one called the beta distribution with parameters (1, n − 1). That’s a nice result!

The marginal distribution for randpd2, on the other hand, is similar to the one for size 2 except that the mean is now something like $1/3$ rather than $1/2$, and because the support is still the whole interval [0, 1] this results in a left-skewing of the distribution. Again, I don’t know how to characterize this distribution exactly. Here are the graphs for sizes 4 and 5:

>>> data = [randpd2(4)[0] for i in range(100000)]
>>> plt.hist(data)

>>> data = [randpd2(5)[0] for i in range(100000)]
>>> plt.hist(data2)

It looks like the marginal distribution generally has mean $1/n$, or something close to that, for every positive integer n, while still having density approaching 0 at the left limit.

In conclusion… this post doesn’t have a conclusion, it just has a bunch of unanswered questions which I’d like to know the answers to.

1. Is the concept of a uniform distribution over the probability distributions of size n sensical?
2. If so, do the returned values of randpd1 and randpd2 have that distribution?
3. If not, what distributions do they have?
4. What’s the exact form of the marginal distribution for randpd2?
5. Which is better: randpd1 or randpd2? Or if one isn’t clearly better than the other, what is each one best suited to being used for?
6. Are there any other algorithms one could use to generate a random probability distribution?