## Dualities between depth-first search and breadth-first search

Something which I think is fairly well-known among programmers (I first learned it from reading Higher Order Perl) is that depth-first search and breadth-first search can be implemented in such a way that they differ only in their choice of data structure—depth-first search uses a stack (where you add to and remove from the same end), breadth-first search uses a queue (where you add to one end and remove from the other). Here is some illustrative Python code:

from collections import deque
from collections.abc import Iterable, Iterator
from typing import Protocol, TypeVar

T = TypeVar('T')

class Tree(Protocol[T]):
def __call__(self, vertex: T) -> Iterable[T]:
"""Yield each of the vertex's children."""

def dfs(tree: Tree[T], start: T) -> Iterator[T]:
stack = deque([start])

while stack:
vertex = stack.pop()
yield vertex
stack.extend(tree(vertex))

def bfs(tree: Tree[T], start: T) -> Iterator[T]:
queue = deque([start])

while queue:
vertex = queue.pop()
yield vertex
queue.extendleft(tree(vertex))


Technical notes:

• The Tree class is just an interface to say that a tree is a function (or another callable object) that, given something (a vertex), returns some other things of the same type (the children of the vertex).
• Ordinary Python lists only support adding and removing elements efficiently (i.e. in constant time) at the back, because adding or removing elements at the front requires all the rest of the elements in the list to shift up or down by one place. They are therefore suitable for use as stacks, but not as queues. Instead I’ve used “deques” from the collections.deque module, which do support efficient adding and removing at both ends and hence make good queues as well as good stacks. Using an ordinary list for dfs would have been perfectly fine, but I’ve used deques for both functions just to makes the code more symmetric.

It’s worth noting however that there is a slight difference between how these two functions work, in that dfs yields sibling vertices in reverse order, compared to bfs (or to the order within the tree itself): for example, list(dfs(lambda _: (1, 2), 0)) evaluates to [0, 2, 1] while list(bfs(lambda _: (1, 2), 0)) evaluates to [0, 1, 2]. This isn’t an inherent consequence of the depth-first-ness or breadth-first-ness of the search order; it’s just due to the fact that with a stack, you take things out in the reverse order compared to how you put them in.

One way to rectify this while preserving the stack/queue duality is to keep vertices themselves out of the stack/queue. Instead, we can directly store the iterators over child vertices that we get from the tree:

def dfs(tree: Tree[T], start: T) -> Iterator[T]:
stack = deque([iter([start])])

while stack:
try:
vertex = next(stack[-1])
except StopIteration:
del stack[-1]
else:
yield vertex
stack.append(iter(tree(vertex)))

def bfs(tree: Tree[T], start: T) -> Iterator[T]:
queue = deque([iter([start])])

while queue:
try:
vertex = next(queue[-1])
except StopIteration:
del queue[-1]
else:
yield vertex
queue.appendleft(iter(tree(vertex)))


For dfs, this approach is likely to improve memory usage as well, since it means that only one iterator needs to be stored for each ancestor of the vertex currently being searched, while with the old dfs function, each younger sibling of each ancestor needs to be stored (and while it is possible for there to be no younger siblings, in which case the old implementation saves more memory, it is more likely that there will be many younger siblings in practice).

For bfs, on the other hand, there is actually only a tiny improvement in memory usage: it will still use the same amount of memory, it will just take a bit longer to get to that point. To be more precise, note that we can divide a tree into “levels”, where the first level consists of the root alone, the second level consists of the root’s children, the third level consists of the root’s grandchildren (i.e. all the children of vertices on the first level), etc. The distinctive characteristic of a breadth-first search is that it always completes one level before it moves on to the next. We can call vertices on the same level “cousins”. Also, given a bunch of cousin vertices, which we can think of as a “part” of the level they’re on, we can say that the part of the previous level consisting of the parents of those cousins is immediately “above”, and the part of the next level consisting of the children of those cousins is immediately “below”.

Now, suppose we’re doing a breadth-first search, and a vertex v at level n is currently being searched. The queue will contain the remaining vertices on level n (i.e. v‘s younger cousins), and the vertices on level n + 1 that have already been added from searching older cousins of v (i.e. the children of those older cousins). In the original bfs function these vertices are actually there in the queue, so the size S of the queue is the size of the remaining part of level n (= the number of younger cousins v has), plus the size of the part of level n + 1 that’s below the reached part of level n (= the total number of children v‘s older cousins have). In iterator-based bfs, on the other hand, the vertices are packaged into iterators according to their common parents; so the size of the queue is the size of the part of level n – 1 that’s above the remaining part of level n, plus the size of the reached part of level n. If the levels are increasing in size, this will generally be a smaller size than S. But at some point later on in the search, we’ll get to the children of v on level n + 1. Then, the size of the queue will be the size of the part of level n that’s above the remaining part of level n + 1, plus the size of the reached part of level n + 1. But the vertices remaining at this point are just the younger cousins of v‘s children, so the part of level n above them consists of v‘s younger cousins. Likewise the reached part of level n + 1 at this point consists of the children of v‘s older cousins. In other words the queue is the same size as it would have been for the original bfs function when searching v.

(There are better ways of doing a memory-efficient breadth-first search, which we’ll say more about shortly.)

Anyway, an alternative way of making dfs yield sibling vertices in the right order would be to write it recursively. This amounts to doing exactly the same thing as using a stack of iterators, except that rather than using an explicit stack we use the language’s call stack, and rather than explicitly working with iterators we use a for loop.

def dfs(tree: Tree[T], start: T) -> Iterator[T]:
yield start

for vertex in tree(start):
yield from dfs(tree, vertex)


This is definitely the shortest and clearest way to write a depth-first search. Its main drawback is that in a language like Python you might end up running into a recursion limit, although it would have to be a very large and deep tree to get to that point.

And although it’s not really a meaningful drawback, the connection with breadth-first search seems to be lost by writing dfs recursively like this. Obviously one can’t write a breadth-first search recursively in the same way as a straightforward translation of iterator-based bfs, since recursion necessarily uses an implicit stack rather than an implicit queue.

However, quite surprisingly, it turns out that there is still a simple alteration we can make in order to make this recursive dfs function do a breadth-first search instead. This is clearest in Haskell notation, where it amounts to reversing the order of a Kleisli composition:

import Control.Monad
data Tree a = Tree a [Tree a]

children :: Tree a -> [Tree a]
children (Tree u vs) = vs

dfs :: Tree a -> [Tree a]
dfs t = [t] ++ (children >=> dfs) t

bfs :: Tree a -> [Tree a]
bfs t = [t] ++ (bfs >=> children) t


In Python, that new bfs function would look like this:

def bfs(tree: Tree[T], start: T) -> Iterator[T]:
yield start

for vertex in bfs(tree, start):
yield from tree(vertex)


Basically we’ve exchanged the places of the call to tree and the recursive call to the search function.

While this is a very short way to write a breadth-first search, I would say it is rather more difficult to understand, compared to recursive dfs. This is mainly due to the fact that, unlike recursive dfs (which could easily be rewritten to return a sequence, rather than just an iterator) it makes essential use of the fact that it’s a generator function, so that the recursive calls in it don’t simply nest, but are rather interwoven with each other. The best way to understand how it works is to insert some logging and see it in action.

from functools import partial
from operator import getitem
import itertools as it

def bfs(tree: Tree[T], start: T) -> Iterator[T]:
call_counter = it.count()

def _bfs() -> Iterator[T]:
call_id = next(call_counter)
print(f'call ID {call_id} begun')
print(f'call ID {call_id}: yielding {start}')
yield start

for vertex in _bfs():
print(f'call ID {call_id}: yielding children of {vertex}')

for child in tree(vertex):
print(f'call ID {call_id}: yielding {child}')
yield child

yield from _bfs()

>>> tree = {1: (2, 3), 2: (4,), 3: (5, 6), 4: (7,), 5: (), 6: (8,), 7: (), 8: ()}
>>> search = bfs(partial(getitem, tree), 1)
>>> list(it.islice(search, 8))
call ID 0 begun
call ID 0: yielding 1
call ID 1 begun
call ID 1: yielding 1
call ID 0: yielding children of 1
call ID 0: yielding 2
call ID 0: yielding 3
call ID 2 begun
call ID 2: yielding 1
call ID 1: yielding children of 1
call ID 1: yielding 2
call ID 0: yielding children of 2
call ID 0: yielding 4
call ID 1: yielding 3
call ID 0: yielding children of 3
call ID 0: yielding 5
call ID 0: yielding 6
call ID 3 begun
call ID 3: yielding 1
call ID 2: yielding children of 1
call ID 2: yielding 2
call ID 1: yielding children of 2
call ID 1: yielding 4
call ID 0: yielding children of 4
call ID 0: yielding 7
call ID 2: yielding 3
call ID 1: yielding children of 3
call ID 1: yielding 5
call ID 0: yielding children of 5
call ID 1: yielding 6
call ID 0: yielding children of 6
call ID 0: yielding 8
[1, 2, 3, 4, 5, 6, 7, 8]
>>> next(search)
call ID 4 begun
call ID 4: yielding 1
call ID 3: yielding children of 1
call ID 3: yielding 2
call ID 2: yielding children of 2
call ID 2: yielding 4
call ID 1: yielding children of 4
call ID 1: yielding 7
call ID 0: yielding children of 7
call ID 3: yielding 3
call ID 2: yielding children of 3
call ID 2: yielding 5
call ID 1: yielding children of 5
call ID 2: yielding 6
call ID 1: yielding children of 6
call ID 1: yielding 8
call ID 0: yielding children of 8
call ID 5 begun
call ID 5: yielding 1
call ID 4: yielding children of 1
call ID 4: yielding 2
call ID 3: yielding children of 2
call ID 3: yielding 4
call ID 2: yielding children of 4
call ID 2: yielding 7
call ID 1: yielding children of 7
call ID 4: yielding 3
call ID 3: yielding children of 3
call ID 3: yielding 5
call ID 2: yielding children of 5
call ID 3: yielding 6
call ID 2: yielding children of 6
call ID 2: yielding 8
call ID 1: yielding children of 8
call ID 6 begun
...


Note that although it does yield the vertices in the correct, breadth-first order, it also gets itself into an infinite loop once it’s done, rather than just terminating! So in that respect, its behaviour is not quite simply that of recursive dfs plus doing the search in breadth-first order.

Let’s forget for a moment about the fact that only the yields from call ID 0 are actually sent back to the original caller, and just look at the sequence of nodes that are yielded, regardless of which recursive sub-call does the yielding. We’ll also break this sequence into segments, with a new segment starting whenever a new begins.

1
1, 2, 3
1, 2, 4, 3, 5, 6
1, 2, 4, 7, 3, 5, 6, 8

Do you see the pattern?

• In each segment, the vertices yielded are all those up to a certain level in the tree. At first we only see 1, the unique vertex at the first level; in each subsequent segment the maximum reachable level increases by 1.
• Within a segment, all vertices within the reachable levels are yielded in depth-first order. (But only the ones from the newest reachable level end up reaching the original caller, since those are the ones yielded by call ID 0).

So what this bfs function is actually doing is really something called iterative deepening depth-first search. It simply does a depth-first search, but only to a limited depth. Initially it only goes to the first level, but whenever it finishes searching, it starts again while increasing the depth limit by one. And it only yields the new vertices on the most recently-reached level. The effect of this is to yield vertices in breadth-first order, so in terms of the result it’s a breadth-first search, but in terms of the algorithm itself it’s more like a variation on depth-first search. It also now makes sense why it gets into an infinite loop eventually: after a certain point any deeper level contains no more vertices, but the algorithm will keep repeating the depth-first search and searching the new empty level.

This might seem like quite a silly way to do things, and it certainly involves a lot of redundant work compared to the alternative, since the higher levels of the tree are searched repeatedly again and again. However, it does have a big advantage in terms of memory usage, because, since it’s really a depth-first search, it has the memory usage of a depth-first search: it only needs to store the ancestors of the vertex currently being searched; in short, its memory usage corresponds to the depth of the tree. With a queue-based breadth-first search, on the other hand, the size of the queue is basically proportional to the size of the level currently being searched, i.e. the tree’s breadth. And trees in practice tend to be broader than they are deep. In fact they are often exponentially broader than they are deep, since all it takes for that to be the case is for the average number of children per vertex (the “branching factor”) to be greater than 1. In such a situation, the redundant work from searching higher levels repeatedly is less of an issue, since the deepest level reached accounts for the majority of the work in searching the tree anyway.

Iterative deepening depth-first search is known as IDDFS for short, and it’s generally associated with the field of artificial intelligence, since that’s where people are (or were) most likely to deal with problems involving brute-force search of massive trees with high branching factors, such as deciding the next move to play in a game of chess; this involves considering each of the possible moves, and for each move, the possible moves that could be made afterwards, and for each of the resulting 2-move sequences, the third moves that could be made afterwards; so it can naturally be modelled as a big tree. Although, nowadays, with the field being more focussed on machine learning, people might not be directly programming these kinds of tree searches so often any more (though I’m just guessing here—I don’t know that much about AI).

I’m not sure how well-known it is that you can write IDDFS as the simple recursive function above. You can of course write it in a more “direct” manner, and this is how it’s presented in standard references such as Korf (1985) or Russell & Norvig (2010). I learned about the connection between the recursive bfs implementation and IDDFS from a comment on an ActiveState code recipe by David Eppstein (and I learned about the recursive bfs implementation in the first place from a StackOverflow answer by Herrington Darkholme, who, it turns out, got it originally from the same code recipe as mentioned on this blog post of theirs).

For the record, here’s a more direct IDDFS implementation, including a termination check so that it stops once it reaches a level where there are no vertices:

import itertools as it

def depth_labelled(tree: Tree[T], start: T) -> tuple[Tree[tuple[T, int]], tuple[T, int]]:
def new_tree(vertex_with_depth: tuple[T, int]) -> Iterator[tuple[T, int]]:
vertex, depth = vertex_with_depth

for child in tree(vertex):
yield (child, depth + 1)

return new_tree, (start, 0)

def depth_limited(tree: Tree[T], start: T, limit: int) -> tuple[Tree[tuple[T, int]], tuple[T, int]]:
tree, start = depth_labelled(tree, start)

def new_tree(vertex_with_depth: tuple[T, int]) -> Iterator[tuple[T, int]]:
vertex, depth = vertex_with_depth

if depth < limit:
yield from tree(vertex_with_depth)

return new_tree, start

def iddfs(tree: Tree[T], start: T) -> Iterator[T]:
for limit in it.count():
limited_tree = depth_limited(tree, start, limit)
level_size = 0

for vertex, depth in dfs(*limited_tree):
if depth == limit:
yield vertex
level_size += 1

if not level_size:
break


To adapt the termination check to the original recursive bfs function, we have to modify the logic a bit so that we only get one level at a time:

def bfs(tree: Tree[T], start: T) -> Iterator[T]:
yield start
iterator = bfs(tree, start)
level_size = 1

while level_size:
level = it.islice(iterator, level_size)
level_size = 0

for vertex in level:
for child in tree(vertex):
yield child
level_size += 1


The function is now behaviorally more of an exact counterpart of dfs, although it is no longer so similar in implementation.

I still find it to be a bit of a startling coincidence that just switching the order of those two calls within the recursive implementation of dfs makes it into a breadth-first search (even if it behaves somewhat differently with regard to termination). It seems like something there should be a deeper explanation for, but I don’t know what it would be. And does it have anything to do with the stack/queue duality? The way IDDFS works doesn’t seem to involve a queue, not even implicitly, so maybe not. I don’t have answers to these questions, unfortunately.

## A simple motivation for the notion of a natural transformation

A natural transformation is an operation on a category, or more precisely a family of operations, one for each object in the category, which is preserved by morphisms in the category.

Each operation in the family is associated with a specific object $A$ in the category, which it is said to be on. The operation itself is an endomorphism on $A$, modulo application of functors in a regular way to the domain and codomain. That is, it is a morphism from $F(A)$ to $G(A)$ for some functors $F$ and $G$ which are the same for every object. We call $F$ and $G$ the domain and codomain of the operation, respectively.

For example, multiplication in a group $G$ is a function from $U(G)^2$ to $U$, where $U$ is the forgetful functor from groups to sets. So the domain and codomain of the natural transformation corresponding to multiplication in groups are $FU$ and $U$ respectively, where $F$ is the endofunctor on the category of sets such that for every set $A$, we have $F(A) = A^2$, and for every function $f : A \to B$ and any two elements $x$ and $y$ of $A^2$, we have $F(f)(x, y) = (f(x), f(y))$.

If, for every object $A$, we write the operation on $A$ as $\alpha_A$, then the naturality condition requires that for every morphism $f : A \to B$, we have $G(f) \circ \alpha_A = \alpha_B \circ F(f)$. This is just the requirement that $f$ “preserves” the operation: applying $f$ to the output is the same as using the value under $f$ as the input (modulo the functors we need to get the types right).

Consider the example of groups: a group homomorphism $f : G \to H$ needs to have the property that for any two elements $x$ and $y$ of $U(G)$, we have $U(f)(xy) = U(f)(x) U(f)(y)$. (Of course we don’t usually write the forgetful functor explicitly, but we’re trying to be precise here.) If we write multiplication in $G$ and $H$ using prefix symbols $\alpha_G$ and $\alpha_H$ respectively, then this equation becomes $U(f)(\alpha_G(x, y)) = \alpha_H(U(f)(x), U(f)(y))$. Looking at the way $F$ acts on morphisms, we see that this can be further rewritten as

$\displaystyle U(f)(\alpha_G(x, y)) = \alpha_H(F(U(f))(x, y)),$

i.e.

$\displaystyle U(f) \circ \alpha_G = \alpha_H \circ F(U(f))$.

I like this way of motivating natural transformations because—well, first of all because it’s a way of motivating natural transformations at all, which is more than pretty much every category theory introduction I’ve seen does, at least if you don’t count saying that there ought to be some notion of morphism between functors, and then declaring by fiat that natural transformations are the appropriate notion of morphism between functors, as motivation. But also because it allows us to reconcile the formal category-theoretic understanding of a category as having its morphisms as part of its data with the perhaps more natural pre-formal understanding of a category as having operations as part of its data, and morphisms being characterized by their preservation of those operations.

In general, we can say that an operation on a category $X$ is a family of morphisms $\alpha_A : F(A) \to G(A)$ on objects $A$ of $X$, where $F$ and $G$ are functors from $X$ to some other category $Y$. In other words, it is like a natural transformation but it doesn’t need to satisfy the naturality condition. Now, if we have an operation like this we can consider the set of the morphisms $f : A \to B$ in $X$ which do satisfy the natural condition, i.e. have $G(f) \alpha_A = \alpha_B F(f)$. It turns out that this set of morphisms is always closed under composition and contains every identity morphism, and therefore induces a subcategory of $X$:

Proof that it’s closed under composition. Suppose $f : A \to B$ and $g : B \to C$ are morphisms in $X$ and $G(f) \alpha_A = \alpha_B F(f)$ and $G(g) \alpha_B = \alpha_C F(g)$. Then

$\displaystyle G(gf) \alpha_A = G(g) G(f) \alpha_A = G(g) \alpha_B F(f) = \alpha_C F(g) F(f) = \alpha_C F(gf).$

Proof that it contains every identity morphism. Suppose $A$ is an object in $X$. Then

$\displaystyle G(1_A) \alpha_A = 1_{G(A)} \alpha_A = \alpha_A = \alpha_A 1_{F(A)} = \alpha_A F(1_A).$

For example, if we consider the category of groups and functions between groups, the category of groups and group homomorphisms is induced in this way as a subcategory by the operation of multiplication in groups.

It’s interesting to consider the converse: given a category $X$ and a subcategory $Y$ of $X$, under what circumstances is there is a category $Z$, functors $F$ and $G$ from $X$ to $Z$, and an operation $\alpha : F \to G$ such that the morphisms in $X$ are precisely the morphisms $f : A \to B$ in $X$ with $G(f) \circ \alpha_A = \alpha_B \circ F(f)$? Obviously $Z$ will have to still contain every object in $X$ (since the construction described here only limits the morphisms in the subcategory, not the objects), but I will leave any more detailed consideration of this question for readers.

## Notes on periodic functions

A period of a function f : ℝ → X is a T ∈ ℝ such that for every t ∈ ℝ, we have f(t + T) = f(t). Note that X may be any set; the definition does not refer to any structure on X other than its equality relation.

For every function f : ℝ → X, the set of periods of f is an additive subgroup of :

• For any two periods T1 and T2 of f and every t ∈ ℝ, we have
f(t + T1 + T2) = f(t + T1) = f(t),
so T1 + T2 is a period of f.
• For every period T of f and every t ∈ ℝ, we have
f(t − T) = f((t − T) + T) = f(t),
so  − T is a period of f.

Therefore, we call it the group of periods of f.

The converse also holds: every additive subgroup G of is the group of periods of a function, namely the canonical projection f : ℝ → ℝ/G (which takes every t ∈ ℝ to its equivalence class wrt the equivalence relation {(t1, t2) : t2 − t1 ∈ G}):

• For every T ∈ G and every t ∈ ℝ, we have (t + T) − t = T ∈ G and hence f(t + T) = f(t), so T is a period of f.
• Conversely, for every period T of f, we have f(T) = f(0) and hence T − 0 = T ∈ G.

Henceforth, we shall always refer to additive subgroups of as groups of periods.

It is worth noting explicitly that, in addition to the two closure properties proven above, the fact that the groups of periods are groups implies that they are closed under integer multiples, and in particular contain 0.

A function is aperiodic iff its group of periods is trivial, i.e. 0 is its only period. Otherwise, it is periodic.

If a group G of periods is nontrivial and cyclic, then it has a nonzero generator a. Since the set of the generators of G is closed under negation, we can choose a to be positive. This a is then the least positive element of G, because for every positive x ∈ G, there is an n ∈ ℤ such that x = na where n ∈ ℤ, and this n has to be positive (since x and a are positive), so that x ≥ a. The logic here works for any ordered group, not just groups of periods. It also works for any positive generator of G—so the least positive element of G is the only positive generator of G.

Conversely, if a group G of periods has a least positive element a, then a generates G. To see this, suppose x ∈ G and let n be the floor of x/a, so that n ≤ x/a < n + 1, i.e. na ≤ x < (n + 1)a, i.e. 0 ≤ x − na < a. Then x − na is an element of G less than a, so it cannot be positive; it must be 0, so that x = na. (The logic here doesn’t work for any ordered group, because it relies on the Archimedean property of the real numbers.)

So a nontrivial group G of periods has a least positive element a iff it is cyclic, in which case a is the unique positive generator of G. In this situation, we call a the fundamental period of G, or more loosely just the period of G.

The “canonical” examples of periodic functions have fundamental periods. For example, the sine and cosine functions have a fundamental period of 2π. However, constant functions are also technically periodic, and they do not have fundamental periods: a function f : ℝ → X is constant iff its group of periods is . There are more obscure examples. Consider the indicator function 1 of . For every T ∈ ℚ and every t ∈ ℝ, we have t + T ∈ ℚ iff t ∈ ℚ and hence f(t + T) = f(t); so T is a period of f. On the other hand, for every irrational T ∈ ℝ, we have ( − T) + T = 0 ∈ ℚ but  − T ∉ ℚ, and hence f(( − T) + T) = 1 but f( − T) = 0; so there is a t ∈ ℝ such that f(t + T) ≠ f(t), and hence T is not a period of f. It follows that the group of periods of 1 is .

There is a topological criterion which can be used to determine whether a periodic function has a fundamental period: a group G of periods is cyclic iff it is not dense in . To see this, observe first that if G is dense in , then it has elements arbitrarily close to 0, symmetrically on both the positive and negative sides of the number line (since G is closed under negation). Therefore G does not have a least positive element and hence is acyclic.

Conversely, suppose G is acyclic and hence has no least positive element. Then for every positive x ∈ G, there is a positive y ∈ G less than x. The difference x − y is also a positive element of G less than x, and at least one of y and x − y is less than or equal to x/2 (otherwise their sum would be greater than x). So in fact there is a positive y ∈ G less than or equal to x/2. By repeatedly applying this we can see that for every n ∈ ℕ, there is a positive y ∈ G less than or equal to x/2n. So G has positive elements arbitrarily close to 0. It follows that for every a ∈ ℝ and every positive ε ∈ ℝ, there is a positive x ∈ G less than ε. The interval (a − ε, a + ε), being of length 2ε > x, must contain nx for some n ∈ ℤ. And since G is closed under integer multiples, we have nx ∈ G. So G has elements arbitrarily close to a. This holds for every a ∈ ℝ, so G is dense in .

This criterion is simplified for continuous periodic functions f : ℝ → X, where X is an accessible topological space (a topological space X is accessible iff any two distinct points in X have neighbourhoods not containing the other point). This is because the group G of periods of such a function f is closed. To see this, suppose T0 is a limit point of G; it will suffice to prove that T0 ∈ G. To see that, suppose t ∈ ℝ; it will suffice to prove that f(t + T0) = f(t). To see that, suppose N is a neighbourhood of f(t + T0); it will suffice to prove that f(t) ∈ N. Since f is continuous, so is its right-composition with adding t. Therefore there is a neighbourhood M of T0 such that f(t + M) ⊆ N. And since T0 is a limit point of G, there is a T ∈ M ∩ G. Since T ∈ M, we have f(t + T) ∈ N. Since T ∈ G, we have f(t + T) = f(t) and hence f(t) ∈ N.

Since G is closed, it is equal to its closure. So it can only be dense in (i.e. have its closure equal to ) iff it is equal to . Therefore, given that f is a continuous periodic function with an accessible codomain, it has a fundamental period iff it is nonconstant.

## A proof of the boundedness theorem by induction

As usual, this post can be viewed as a PDF.

Theorem (Boundedness Theorem). A continuous real-valued function on a closed interval is bounded.

Proof. Suppose f is such a function and [a, b] is its domain. First, observe that for every c ∈ [a, b], since f is continuous at c, there is a positive $\delta \in {\mathbb R}$ such that for every x ∈ [a, b]∩(c − δ, c + δ), we have f(x)∈(f(c)−1, f(c)+1). So there is a neighbourhood of c, namely (c − δ, c +  δ), on which f is bounded. Thus f is, in a sense, locally bounded at every point in its domain; the problem is to prove that this local boundedness implies global boundedness.

In textbook proofs of the boundedness theorem, this is generally done using what I would regard as a trick, such as supposing f isn’t bounded and using the Bolzano-Weierstrass theorem to obtain a contradiction. More advanced texts may appeal to the compactness of [a, b], but the proof that [a, b] is compact (the Heine-Borel theorem) amounts to basically the same logic, and is usually no less trickful1.

However, if we think about how to construct an algorithm to find a global bound (often a helpful move to make—the un-situated-in-time-ness of ordinary mathematical language can be quite thought-limiting) then there is a procedure which, in my opinion, is quite obvious and immediately suggests itself. Just start on the left, with the singleton interval {a}, on which f is certainly bounded, and repeatedly apply local boundedness to the right endpoint, gradually expanding the subinterval of [a, b] on which we know f to be bounded. At each step the existing subinterval has a bound, and the neighbourhood of the right endpoint has a bound; taking the maximum of these two bounds gives us a bound of the whole expanded subinterval formed by unioning2 the neighbourhood with the existing subinterval. Eventually the right endpoint will reach b and we will no longer be able to expand it further, at which point we stop and observe that we have bounded the whole of [a, b]. (Sanity check: why would this fail for (a, b)? Because then we wouldn’t be able to take the right endpoint to b; there would be nowhere to stop the procedure, without leaving a part of the right of (a, b) unbounded.)

Of course, this algorithm will not in general terminate in a finite number of steps, but this is no issue; don’t let your thinking be limited by the arbitrary limitations of physical machines. Obviously, it is still necessary to prove that this algorithm will terminate, even if it takes infinitely many steps. Since we’re now dealing with the distinctions among the infinite quantities, we’re going to need to use some set-theoretic machinery (which is why you won’t see this proof in introductory real analysis textbooks). I’ll assume that you are familiar with the ordinals, the techniques of transfinite induction and recursion, and Burali-Forti’s paradox (which says that there is no set of all ordinals).

Here’s the plan. Using transfinite recursion, we shall construct an ordinal-indexed sequence xα of members of [a, b] such that every ordinal α has the following properties:

1. The function f is bounded on [a, xα].

2. We have xα ≤ xα + 1, and if xα = xα + 1, then xα + 1 = b.

Then, since there are a proper class’s worth of ordinals and [a, b] is just a set, the sequence xα will have to repeat3, so there will be ordinals α and β such that α < β and xα = xβ. By (2), the sequence xα will be nondecreasing, so we will have xα ≤ xα + 1 ≤ xβ and hence xα = xα + 1 = xβ = b. Then by (1), it will follow that f is bounded on [a, b].

All we need to do to complete the proof is describe the recursion and make it clear that it is a valid recursion (i.e. all of the assumed properties used to construct the next term of the sequence are preserved by the constructed sequence with the next term added).

Base case
Let x0 = a and observe that f is bounded on {a} by |f(a)|.
Successor case
For every ordinal α, if we assume that xα ∈ [a, b], then since f is continuous, there is a positive $\delta \in {\mathbb R}$ such that f is bounded on [xα, xα + δ] by some M. Let xα + 1 = min(b, xα + δ), and observe that:

• If b ≤ xα + δ, then xα + 1 = b and hence a ≤ xα ≤ xα + 1 ≤ b. Otherwise, we have xα + 1 = xα + δ and hence a ≤ xα < xα + 1 ≤ b, with the middle inequality strict, so that we can only have xα = xα + 1 when xα + 1 = b.
• We have xα + 1 ≤ xα + δ, so f is bounded by M on [xα, xα + 1] as well as [xα, xα + δ]. Assuming f is bounded on [a, xα] by some L, it follows that f is bounded on [a, xα + 1] by max(L, M).

Limit case

For every limit ordinal λ, let A = {xα : α < λ}. Then A contains x0 and hence is nonempty, and is bounded above by b, so it has a supremum. Let xλ = supA, and observe that:

• If we assume that A ⊆ [a, b], then since A is nonempty and xλ is ge every member of A, we have xλ ≥ a; and since b bounds A above and xλ is the smallest upper bound of A, we also have xλ ≤ b.
• Since xλ ∈ [a, b] and f is continuous, there is a positive $\delta \in {\mathbb R}$ such that f is bounded on [xλ − δ, xλ] by some M. And since xλ = supA, there is an ordinal α < λ such that xλ − δ < xα and hence f is bounded on [xα, xλ] by M. Assuming f is bounded on [a, xα] by some L, it follows that f is bounded on [a, xλ] by max(L, M).

And that’s it.

Of course, this proof is a bit tedious. There might be ways to make it shorter (maybe we can use Zorn’s lemma instead of transfinite induction). But the advantage it has over other proofs of the boundedness theorem I’ve seen is that it falls out more or less automatically, without requiring any flashes of insight.

1. This word doesn’t appear to exist currently, but it needs inventing.

2. Another word that needs inventing. OK, I could use “uniting”, but I suspect people wouldn’t immediately make the connection with the union operation on sets if I used that word.

3. This is a sort of infinitary pigeonhole principle.

## Notes on proving the completeness theorem for propositional logic

NB: I’ve opted to just get straight to the point with this post rather than attempting to introduce the subject first, so it may be of little interest to readers who aren’t already interested in proving the completeness theorem for propositional logic.

The key thing I had to realize for the proof of the completeness theorem for propositional logic to “make sense” to me was that interpretations of a language of propositional logic can be thought of as theories.

Normally, an interpretation of a language L of propositional logic is thought of as a function υ from the set of the sentence variables in L to {0, 1}, which is extended to the set of the formulas in L by letting υ(⊥)=0 and recursively letting
$\upsilon(\phi \rightarrow \psi) = \upsilon(\psi)^{\upsilon(\phi)} = \begin{cases} 0 &\quad \text{if } \upsilon(\phi) = 0 \text{ or } \upsilon(\psi) = 1, \\ 1 &\quad \text{if } \upsilon(\phi) = 1 \text{ and } \upsilon(\psi) = 0 \end{cases}$
for every pair of sentences ϕ and ψ in L (assuming and are the primitive connectives). The value under υ of a sentence ϕ is thought of as the truth value of ϕ under υ, with 0 standing for falsity and 1 standing for truth.

A theory in L, on the other hand, is thought of as a set of sentences in L, these sentences being the nonlogical axioms of the theory. It’s a completely different type of object from an interpretation.

But if you think about what these formal concepts are trying to get at, they’re quite similar. Both of them are essentially requirements that certain sentences be true. A theory requires its nonlogical axioms to be true. An interpretation requires the sentences true under it to be true, and the sentences false under it to be false, which might appear to be a slightly more elaborate concept, but since a sentence is false iff its negation is true, you know what sentences an interpretation makes false if you know what sentences it makes true. The only substantial difference is that an interpretation is subject to certain restrictions compared to a general theory:

1. An interpretation of L must require every sentence variable in L to be either true or false (and not both).

2. An interpretation of L must require to be false.

3. For every pair of sentences ϕ and ψ in L, an interpretation of L must require ϕ → ψ to be true iff it requires ϕ to be false or requires ψ to be true (or both).

Formally, we can define a function f on the set of the interpretations of L by the rule that for every interpretation υ of L, we have
$f(\upsilon) = \{\phi : \upsilon(\phi) = 1\} \cup \{\neg \phi : \upsilon(\phi) = 0\}.$
This function f is an injection into the set of the theories in L. But it is not surjective—for every theory M in L, the value f−1(M) exists only if the statements below hold:

1. For every sentence variable A in L, exactly one of A and ¬A is a member of M.
2. ⊥ ∉ M.

3. For every pair of sentences ϕ and ψ in L, we have ϕ → ψ ∈ M iff ϕ ∉ M or ψ ∈ M.

An interpretation could reasonably be defined as a theory M in L with properties (1)-(3) above.

Now, the completeness theorem says that every syntactically consistent theory in L is semantically consistent, i.e. has a model. A model of a theory T in L is an interpretation of L under which every member of T is true, so if we adopt the definition of interpretations as theories, we can say that a model of T is an extension of T which is an interpretation of L. This leads us to the idea that in order to construct a model of an arbitrary syntactically consistent theory T in L, we will have to extend T by adding new members.

But how exactly do we extend T? Here, it helps to recall the soundness theorem, which is the converse of completeness: it says that every semantically consistent theory in L is syntactically consistent. (This is generally straightforward to prove from the definition of syntactic consequence. I’m not going to prove it here because I’m trying to be agnostic about exactly how syntactic consequences are defined, and the proof is different depending on how syntactic consequences are defined.) Every interpretation of L, thought of as a theory in L, is certainly semantically consistent (since it has itself as an extension) and hence, by soundness, syntactically consistent. Therefore one thing we definitely need to do as we add new members to T, in order for it to eventually become an interpretation of L, is preserve the syntactic consistency of T.

In fact, it’s not too difficult to see that every theory M in L which can be thought of as an interpretation of L is not only syntactically consistent, but maximally syntactically consistent: every proper extension of M is syntactically inconsistent.

To see this, first, observe that for every sentence ϕ in L, since ¬ϕ abbreviates ϕ → ⊥, property (3) tells us that ¬ϕ ∈ M iff ϕ ∉ M or ⊥ ∈ M; and M never contains . So we have ¬ϕ ∈ M iff ϕ ∉ M. In other words, M contains exactly one of ϕ and ¬ϕ.

Now, suppose M is a proper extension of M. Then it has a member ϕ which is not a member of M. The negation of ϕ must then be a member of M. Since M extends M, it follows that ¬ϕ ∈ M. But we also have ϕ ∈ M. Since every member of M is a syntactic consequence of M, it follows that M is syntactically inconsistent.

Now that we know every interpretation of L is maximally syntactically consistent, the natural next question to ask is whether the converse holds, i.e. every maximally syntactically consistent theory T in L is an interpretation of L. If the converse does hold, then to prove the completeness theorem, all we need to do is extend a syntactically consistent theory to a maximal syntactically consistent theory. As it happens, the converse does hold.

Lemma 1. For every maximally syntactically consistent theory M in L and every sentence ϕ in L, exactly one of ϕ and ¬ϕ is a member of M.

Proof. Suppose M is a maximally syntactically consistent theory in L and ϕ is a sentence in L.

If M contains both ϕ and ¬ϕ, then M is syntactically inconsistent. So M contains at most one of ϕ and ¬ϕ.

If M contains neither ϕ nor ¬ϕ, then M ∪ {ϕ} and M ∪ {¬ϕ} are proper extensions of M and hence are syntactically inconsistent, from which it follows by negation introduction and elimination that M syntactically implies both ϕ and ¬ϕ, and hence is syntactically inconsistent. So M contains at least one of ϕ and ¬ϕ. ■

Theorem 1. Every maximally syntactically consistent theory M in L is an interpretation of L.

Proof. Suppose M is a maximally syntactically consistent theory in L.

For every sentence variable A in L, exactly one of A and ¬A is a member of M by the lemma above.

If ⊥ ∈ M, then M syntactically implies and hence is syntactically inconsistent; so we have ⊥ ∉ M.

Suppose ϕ and ψ are sentences in L. We shall prove that ϕ → ψ ∈ M iff ϕ ∉ M or ψ ∈ M.

For the forward implication, suppose ϕ → ψ ∈ M. If ϕ ∈ M and ψ ∉ M, i.e. ¬ψ ∈ M, then M syntactically implies both ϕ → ψ and ¬ψ, so by modus tollens it follows that M ⊢ ¬ϕ. But we also have M ⊢ ϕ, so M is syntactically inconsistent. This is a contradiction, so one of ϕ ∈ M and ψ ∉ M must not hold.

For the backward implication:

1. First, suppose ϕ ∉ M, i.e. ¬ϕ ∈ M. Then M ⊢ ¬ϕ, so M ⊢ ϕ → ψ.
2. Second, suppose ψ ∈ M. Then M ⊢ ψ and hence M ⊢ ϕ → ψ.

Either way, we have M ⊢ ϕ → ψ. Therefore, if ϕ → ψ ∉ M, i.e. ¬(ϕ → ψ ∈ M), so that M ⊢ ¬(ϕ → ψ), we have that M is syntactically inconsistent, which is a contradiction. So ϕ → ψ ∈ M. ■

Now, to complete the proof, we just need to prove that an arbitrary syntactically consistent theory T in L can be extended until it is maximally syntactically consistent. The standard tool for carrying out such proofs is Zorn’s lemma. (There are other techniques that can be used, like transfinite induction; if you’re not too familiar with proofs like this, using transfinite induction generally makes things clearer. But Zorn’s lemma makes the proof more concise, so that’s what I’ll use here.)

Note that we make use of the “syntactic compactness theorem” here, which says that a theory is syntactically consistent iff each of its finite subsets is syntactically consistent. Unlike the semantic compactness theorem, which is most straightforwardly proven as a consequence of completeness, the syntactic compactness theorem is trivial; it essentially follows from the fact that proofs are finite and hence any proof of only makes use of finitely many axioms.

Theorem 2. Every syntactically consistent theory has a maximal syntactically consistent extension.

Proof. Suppose T is a syntactically consistent theory. To prove that T has a maximal syntactically consistent extension, we shall use Zorn’s lemma; so suppose 𝒯 is a chain of syntactically consistent extensions of T and let T be the union of the theories in 𝒯. To prove that T is consistent, we shall use the syntactic compactness theorem; so suppose U = {ϕ1, ϕ2, …, ϕn} is a finite subset of T. In the case where U is empty, it is certainly syntactically consistent. Otherwise, let T1, T2, …and Tn be theories in 𝒯 containing ϕ1, ϕ2, …and ϕn respectively. Then {T1, T2, …, Tn} is a finite, nonempty chain, so it has a maximum U, which is syntactically consistent and extends U. Therefore U, having a syntactically consistent extension, must be syntactically consistent itself. ■

## Programming languages as theorem verifiers

(This post is also available as a PDF).

One of the most interesting ideas in modern logic is the Curry-Howard correspondence. This is the informal observation that, from a certain perspective, mathematical proofs and computer programs are the same thing.

To be more exact: there is a correspondence between mathematical formulas and data types in computer programs. In particular, the correspondence is not with data values. This is something that puzzled me a little when learning about this—why couldn’t it be with data values? But I think I can now express why it has to be with types: a mathematical formula expresses that something is the case, and a data type, when attached to a data value, can likewise be thought of as expressing that something is the case for that data value. A data value, on the other hand, is just a thing; it makes no sense to say that it expresses something. (Perhaps one could say that data values correspond to the truth values of formulas, as opposed to the formulas themselves.)

In addition, there is a correspondence between rules of inference and operations which combine programs. Just as mathematical proofs can be thought of as being built up from smaller proofs whose combination is justified by rules of inference, so computer programs can be thought of as being built up from smaller programs using various operations (different words may be used depending on the programming language—“subroutine”, “function”, etc.—but the most general way to describe how programs are built up is to say that they are built up from smaller programs).

As an illustration, let’s have a look at how the Hilbert system which I talked about in the last post can be described in terms of programs, types and operations rather than proofs, formulas and rules of inference. A Hilbert system described in such a way becomes a system of combinatory logic.

The defining characteristic of Hilbert system is that they have few rules of inference and rely on axioms for their expressive power. Accordingly, combinatory logic systems have few ways of combining programs and rely on atomic programs known as combinators for their expressive power. (Note that the programs in combinatory logic are simple ones, which just transform inputs into outputs without altering any state. As a result, they can just as easily be thought of as mathematical functions—thinking of combinatory logic as a programming language is just one way of thinking of it, albeit one which comes quite naturally.)

Although combinatory logic can be presented formally, it exists embedded in any programming language with a reasonably powerful type system, such as Haskell. Doing combinatory logic in a practical programming language helps it sink in that there’s a real correspondence there. So, here is a complete proof of the reflexivity of the conditional connective in a restricted fragment of Haskell:

s x y z = x z (y z)
k x y = x

i :: a -> a
i = s k k

main :: IO ()
main = return ()

The first two lines in this program define the k and s combinators, which correspond to the two axiom schemes in a minimal Hilbert system. These are implemented in Haskell as polymorphic functions, with instantiations of the functions with specific types corresponding to the individual axioms permitted by the schemes. Note the type signatures (I didn’t write them out explicitly in the program, since Haskell can infer them automatically):

k :: a -> b -> a
s :: (a -> b -> c) -> (a -> b) -> a -> c

The signatures are just the same as the axioms of the minimal Hilbert system, except that the function type-forming operator -> is used in place of the conditional connective, and the operands are thought of as type variables, not formula variables.

The third line, the type signature of i, is where we state the theorem that we’re trying to prove—in this particular case, it’s p → p for an arbitrary formula p. Replacing the formula variable p with a type variable a and the conditional connective with the -> operator, we get the type signature a -> a. So in combinatory logic, the theorem amounts to the assertion that there is a function that returns values of the same type as its argument.

The proof itself is the definition of the function. This definition, by itself, determines the type of the function; the type signature declared in the third line only comes into play because the Haskell compiler will check that it is compatible with the actual type determined, and if it isn’t, it will refuse to compile the program. Thus, the theorem is proven if and only if the program successfully compiles.

In the proof of p → p in a Hilbert system, the first step is to assert that
(p → (p → p)→p)→(p → p → p)→p → p
is an axiom, because it is an instance of the axiom scheme (p → q → r)→(p → q)→p → r. In Haskell, we could make this assertion explicit by defining a new function with the type signature (a -> (a -> a) -> a) -> (a -> a -> a) -> a -> a, but making the definition of this function just delegate to s:

s1 :: (a -> (a -> a) -> a) -> (a -> a -> a) -> a -> a
s1 = s

This would make the compiler check that the polymorphic function s can indeed be instantiated with this type. It isn’t absolutely necessary, though; we can just use s itself in place of s1, without requiring the compiler to verify the intermediate step. In fact, in the program I wrote above I have not asked the compiler to verify any of the intermediate steps, only the final type signature of a -> a, because the program is nice and concise that way.

The next step in the proof of p → p, in its most detailed version, would be to assert that p → (p → p)→p is a valid instance of the axiom scheme p → q → p. This can be done in Haskell in much the same way, by defining a function k1 which is the same as k but has a more specific type signature.

The next step after that is to apply the modus ponens rule of inference to these two axioms, proving the formula (p → p → p)→p → p. What is the counterpart of modus ponens in combinatory logic? It’s simply function application, because if x is a function of type a -> b and y is a value of type a, then type of the value x y returned by x, given y as its argument, is b. So the Haskell way of expressing this step is as follows:

i0 :: (a -> a -> a) -> a -> a
i0 = s1 k1

(Note that s1 and k1 can be replaced with s and k here, if they were not defined earlier.)

The final step is to apply modus ponens to the formula p → p → p)→p → p (the one we just proved) and the axiom p → p → p. This amounts to applying the i0 function to k (of course we could also define a k2 function with the specific type signature a -> a -> a, and use it in place of k, as with the other two axioms used). If we compress the whole proof into one function, we get the i function in the program above:

i :: a -> a
i = s k k

And if you’re wondering about the last two lines in the original code snippet, they’re just required in order to make the program a valid Haskell program that will compile.

There’s a lot more to say about this subject, such as the role of intuitionistic vs. classical logic, and the relationship between quantifiers and dependent types …however, I still don’t understand these deeper aspects of the subject very well, so I won’t say any more in this post.

## Motivating Hilbert systems

(This post is also available as a PDF.)

Hilbert systems are formal systems that encode the notion of a mathematical proof, which are used in the branch of mathematics known as proof theory. There are other formal systems that proof theorists use, with their own advantages and disadvantages. The advantage of Hilbert systems is that they are simpler than the alternatives in terms of the number of primitive notions they involve.

Formal systems for proof theory work by deriving theorems from axioms using rules of inference. The distinctive characteristic of Hilbert systems is that they have very few primitive rules of inference—in fact a Hilbert system with just one primitive rule of inference, modus ponens, suffices to formalize proofs using first-order logic, and first-order logic is sufficient for all mathematical reasoning. Modus ponens is the rule of inference that says that if we have theorems of the forms p → q and p, where p and q are formulas, then χ is also a theorem. This makes sense given the interpretation of p → q as meaning “if p is true, then so is q”.

The simplicity of inference in Hilbert systems is compensated for by a somewhat more complicated set of axioms. For minimal propositional logic, the two axiom schemes below suffice:

1. For every pair of formulas p and q, the formula p → (q → p) is an axiom.
2. For every triple of formulas p, q and r, the formula (p → (q → r)) → ((p → q)→(p → r)) is an axiom.

One thing that had always bothered me when reading about Hilbert systems was that I couldn’t see how people could come up with these axioms other than by a stroke of luck or genius. They are rather complicated, and even more so, the proofs that one generates using them directly are rather complicated. To illustrate, here’s an example of a proof in a Hilbert system using these two axioms:

Theorem 1. For every formula p, the formula p → p is a theorem.

Proof.

1. From axiom scheme 1 we have that p → (p → p) is an axiom.
2. From axiom scheme 1 we have that p → ((p → p)→p) is an axiom.
3. From axiom scheme 2 we have that (p → ((p → p)→p)) → ((p → (p → p)) → (p → p)) is an axiom.
4. Applying modus ponens to the theorems proved in steps 2 and 3, we see that (p → (p → p)) → (p → p) is a theorem.
5. Applying modus ponens to the theorems proved in steps 1 and 4, we see that p → p is a theorem.

This proof is rather difficult to come up with, because the reasoning in it is totally unlike how people naturally do mathematical reasoning. A more natural informal proof that p → p is a theorem would probably go something like this:

1. Assume that p is true.
2. Then (repeating ourselves) we have that p is true.
3. It follows that p → p is a theorem by applying the rule that of inference saying that if we assume some formula q is true and then derive some formula r, then q → r is a theorem.

The rule of inference mentioned in step 3 is called deduction. The rules of inference of modus ponens and deduction together encapsulate the interpretation of the conditional connective as meaning “if …then”. Whereas modus ponens tells us what we can prove from an “if …then” statement, deduction tells us how we can get to an “if …then” statement.

Although deduction isn’t a primitive rule of inference in Hilbert systems, it can be proven to be a valid derived rule of inference. This requires a formalization of the notion of assumption; fortunately this doesn’t require any extra machinery. The assumption of a formula p can be thought of as a movement from the original Hilbert system into another Hilbert system that has p as an extra axiom. Let us introduce some convenient notation: if we identify Hilbert systems with their sets of axioms, then given a Hilbert system Γ and a formula p, we can write the Hilbert system obtained by adding p as an axiom to Γ as Γ ∪ {p}, using set notation. Then we can formally state the deduction theorem like so:

Theorem 2. For every Hilbert system Γ and every pair of formulas p and q such that q is a theorem of Γ ∪ {p}, the formula p → q is a theorem of Γ.

The proof of the deduction theorem is quite straightforward if we use an inductive technique. Since modus ponens is the only rule of inference in Hilbert systems, the set of the theorems of a Hilbert system Γ can be defined as the smallest set X with the two properties listed below:

1. Every axiom of Γ is a member of X.
2. For every pair of members of X of the forms p → q and p, where p and q are formulas, the formula q is also a member of X.

Therefore, in order to prove that a set X contains every theorem of Γ, it suffices to prove that X has these two properties.

For the deduction theorem, if we are given some fixed formula p, we may consider the set X of the theorems q of Γ ∪ {p} such that p → q is also a theorem of Γ. Then, to prove the first property, we have to prove that X contains every axiom of Γ ∪ {p}. By Theorem 1 we have that p → p is a theorem of Γ, so X certainly contains p. As for the axioms of Γ, by axiom scheme 1 we have that for every formula q, including the axioms of Γ, the formula q → (p → q) is a theorem of Γ. Applying modus ponens, it follows that p → q is a theorem of Γ for every axiom q of Γ.

The second property is even more straightforward to prove. Suppose q and r are formulas and q → r and q are members of X, so that p → (q → r) and p → q are theorems of Γ. By axiom scheme 2, we have that (p → (q → r)) → ((p → q)→(p → r)) is a theorem of Γ. Applying modus ponens to this theorem and p → (q → r), it follows that (p → q)→(q → r) is a theorem of Γ. Applying modus ponens again to this theorem and p → q, it follows that p → r is a theorem of Γ and hence r ∈ X. This completes the proof of the deduction theorem.

Now, I realized just the other day that this proof can be used to explain where axiom schemes 1 and 2 come from. Suppose we were trying to prove the deduction theorem using modus ponens only without knowing what axiom schemes to start with. The proof would proceed as usual except for the final steps in proving each of the two properties. For the first property, we would need to be able to prove that p → p is a theorem of Γ in one case, and in the other case we would get to a state where we would have that a formula q is a theorem of Γ, and we would need to be able to conclude that p → q is also a theorem of Γ. For the second property, we would get to a state where we would have that formulas of the forms p → (q → r) and p → q, where q and r are formulas, are theorems of Γ, and we would need to be able to conclude that p → r is also a theorem of Γ. This would then naturally motivate us to introduce the three axiom schemes listed below, which are exactly the theorems we need to have available in order to draw the required conclusions straightforwardly by repeatedly applying modus ponens.

1. For every formula p, we have p → p.
2. For every pair of formulas p and q, we have p → (q → p).
3. For every triple of formulas p, q and r, we have (p → (q → r)) → ((p → q)→(q → r)).

At some point we would have to realize that the first axiom scheme was unnecessary, of course, and this would probably have to happen as a surprising discovery after playing around with the system.

(One would also naturally wonder if it would be possible to prove the second axiom from the first and third, or the third axiom from the first and second, instead of the first axiom from the second and third; or even the second axiom from the third alone, or the third axiom from the second alone. Presumably it is impossible in all cases, but I don’t know how prove this.)

Nevertheless, I feel a lot more comfortable with Hilbert systems now that I can see how one might be directed towards their definition.

## Tricks in computer arithmetic

(This post is also available as a PDF, with better typesetting for the mathematical formulas and syntax highlighting for the code listings.)

People who do a lot of mental arithmetic often make use of “tricks” to carry out calculations. An example familiar to most people is that you can multiply an integer represented by its decimal expansion by 10 by simply adding an extra 0 digit: for example, 321 times 10 is 3210. Another trick, which not so many people are familiar with, is that in order to determine whether an integer is divisible by 3 it suffices to examine the sum of the digits in its decimal expansion: the original integer is divisible by 3 if and only if this sum is. For example, 321 must be divisible by 3 because 3 + 2 + 1 = 6, and 6 is divisible by 3. The defining characteristic of tricks like these is that they enable people to do less calculations to reach their result, reducing the time taken, the cognitive effort required and the likelihood of error, while at the same time only being applicable in a limited range of circumstances, so that they are unable to fully replace the more cumbersome but more general algorithms that are taught to us in school.

It might come as a surprise to learn that computers also make use of such tricks. Computers can achieve much greater speeds and have a greater working memory capacity than humans, and it hardly matters how much effort they have to go to calculate things; and they hardly ever make errors, provided they are making use of a correct algorithm.1 So one might think they wouldn’t need to resort to tricks. But computers often need to perform lots of arithmetic calculations in a short amount of time, and in such circumstances, any slight speed up in one of the individual operations can have an arbitrarily large effect on the speed of the whole procedure, depending on how many times the operation needs to be repeated. So it’s primarily the fact that tricks increase the speed of calculation that makes them worthwhile for computers.

A big difference between computers and humans, when it comes to arithmetic, is that computers represent integers by their binary expansions, rather than their decimal expansions. But a lot of the tricks humans use can still be transferred fairly straightforwardly to the binary environment. For example, whereas adding a zero digit multiplies the value of a decimal expansion by 10, adding a zero digit multiplies the value of a binary expansion by 2. So computers are best at multiplying by 2, rather than by 10. This is actually a lot more useful—it’s more often necessary to double a quantity than to multiply it by 10.

What about the trick for checking whether an integer is divisible by 3? Does that have a binary counterpart? Well, let’s think about how this trick works. It’s basically a consequence of the fact that 10 is congruent to 1 modulo 3. If we have an integer x whose decimal expansion is made up of the digits d0, d1, …and dn (in that order, from least significant to most significant), then we have the equation
$x = d_0 + 10 d_1 + \dotsb + 10^n d_n = \sum_{k = 0}^n 10^k d_k.$

Now, if we only care about congruence modulo 3, we can replace the terms in the sum on the right-hand side with terms that are congruent modulo 3 to the original terms. We can also replace factors within those terms by factors that are congruent modulo 3 to the original factors. In particular, we can replace the 10s by 1s. Since 1 is the multiplicative identity, this allows us to eliminate the factors of 10. Therefore, we have the congruence
$x \equiv \sum_{k = 0}^n d_k \pmod 3.$

That is, the value of x modulo 3 is the same as the value modulo 3 of the sum of the digits in the decimal expansion of x. This proves that the trick works, because x being divisible by 3 is equivalent to x being congruent to 0 modulo 3. In fact it gives us two more tricks: an integer is 1 plus a multiple of 3 if and only if the sum of its digits is also 1 plus a multiple of 3, and an integer is 1 minus a multiple of 3 if and only if the sum of its digits is also 1 minus a multiple of 3.

Now, if d0, d1, …and dn are binary bits rather than decimal digits, then we must start with the equation
$x = d_0 + 2 d_1 + \dotsb + 2_n d^n = \sum_{k = 0}^n 2^k d_k,$
with the digits being multiplied by powers of 2 rather than powers of 10. The number 2 is congruent to −1, not 1, modulo 3. But this still allows us to do some simplification, since ( − 1)k is 1 for even integers k and −1 for odd integers k. The congruence simplifies to
$x \equiv \sum_{k = 0}^n (-1)^k d_k \pmod 3,$
showing that x is congruent modulo 3 to the alternating sum of its bits.

A computer could calculate this alternating sum by simply iterating over the bits of x, alternating between an adding and subtracting state. However, this wouldn’t be very efficient; it would require as many iterations as x has bits, which means it would likely be no quicker than simply using the division algorithm a lot of the time.

It is possible to do better. Rather than taking each bit as a unit, we can take each two-bit segment as a unit. This will eliminate the need to explicitly alternate between addition and subtraction. So, assuming n is odd (so that x has an even number of bits, including the one with place value 1), we may write
$x \equiv \sum_{k = 0}^{\frac {n - 1} 2} (d_{2k} - d_{2k + 1}) \pmod 3.$

Now here’s something neat: −1 is congruent to 2 modulo 3! So we can also write this congruence as
$x \equiv \sum_{k = 0}^{\frac {n - 1} 2} (d_{2k} + 2d_{2k + 1}) \pmod 3.$
Now, for every integer k between 0 and $\frac {n - 1} 2$ inclusive, the sum d2k + 2d2k + 1 is just the value of the two-bit integer whose bits are d2k (least significant) and d2k + 1 (most significant). So we can state the binary rule for divisibility by 3 as follows:

A binary expansion has a value divisible by 3 if and only if the sum of the values of its two-bit segments, interpreted as independent binary expansions, is divisible by 3. More generally, its value is congruent to this sum modulo 3.

This rule can be straightforwardly applied in a computer program. It’s just a matter of summing segments. Once we have the sum, we can use a lookup table to determine its value modulo 3, since the set of possible values of the sum will be much smaller than the set of possible values of the original integer. The summing of the segments can be done in parallel using bitwise operations in an unrolled loop. Here’s an implementation in the C programming language.

unsigned mod3(unsigned x) {
static unsigned TABLE[48] = {
0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2,
0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2,
0, 1, 2, 0, 1, 2
};

/* Sum adjacent 2-bit segments in parallel. Note that 0x33333333 is
8 repetitions of the bit sequence 0011, and 0xcccccccc is 8
repetitions of the bit sequence 1100. */
x = (x & 0x33333333) + ((x & 0xcccccccc) >> 2);

/* Sum adjacent 4-bit segments in parallel. Note that 0x0f0f0f0f is
4 repetitions of the bit sequence 00001111, and 0xf0f0f0f0 is 4
repetitions of the bit sequence 11110000. */
x = (x & 0x0f0f0f0f) + ((x & 0xf0f0f0f0) >> 4);

/* Sum adjacent 8-bit segments in parallel. Note that 0xff00ff00 is
2 repetitions of the bit sequence 0000000011111111, and 0xf0f0f0f0
is 2 repetitions of the bit sequence 1111111100000000. */
x = (x & 0xff00ff00) + ((x & 0x00ff00ff) >> 8);

/* Sum the two 16-bit segments. */
x = (x & 0x0000ffff) + ((x & 0xffff0000) >> 16);

return TABLE[x];
}

Unfortunately, this implementation does not appear to actually be more efficient than a regular modulo operation. I wrote a profiler for the routine (the source code is available on GitHub at https://github.com/Andrew-Foote/odds-and-ends/blob/master/mod3.c—note that it is not written to be portable) and ran it on Windows using Microsoft’s Visual C++ compiler. I also profiled the calculation of values modulo 3 using the ordinary modulo operator for comparison. 16777216 calls to the my subroutine took about 200 milliseconds, but 16777216 ordinary modulo operations took only about 150 milliseconds.

Of course, it may be that the method could be more efficient than a regular modulo operation if my code was better. I’m not very experienced with this kind of programming.

# Another trick

Although our trick of summing the 2-bit segments didn’t pay off, we can find a trick that does pay off by simply taking a C program that computes values modulo 3 in the obvious way, using the modulo operator, and compiling this program with an optimizing compiler. An optimizing compiler will optimize whatever it can, so if there is a trick that can be used to calculate values modulo 3 more efficiently, the compiler should make use of it.

To see the assembly output from a compiler, there’s no need to actually run a local compiler: Matt Godbolt’s Compiler Explorer tool at https://godbolt.org has got you covered. It’s a very neat website that lets you choose from an array of different compilers for different languages hosted on its own servers, so that you can quickly compare outputs.

Here’s the code for a C function which does an ordinary modulo operation. It deals with unsigned (non-negative) integers only, to keep things maximally simple.

unsigned mod3(unsigned x) {
return x % 3;
}

Compiling with the GNU C Compiler (GCC), version 8.3, on an x86-64 architecture and using the -O3 flag (for maximal standards-compliant optimization), we get this assembly output:

mod3:
mov eax, edi
mov edx, -1431655765
mul edx
mov eax, edx
shr eax
lea eax, [rax+rax*2]
sub edi, eax
mov eax, edi
ret

This clearly isn’t just doing a div. So what’s going on? In case you can’t read x86-64 assembly, the assembly subroutine is effectively using the following formula2 to compute the value modulo 3 of the argument x:
$x \bmod 3 = x - 3 \left \lfloor \frac {2863311531x} {2^{33}} \right \rfloor.$

In general, the remainder of an integer a on division by another integer b can be calculated by subtracting the quotient yielded by the same division from a. So the assembly code is really calculating $\lfloor \frac x 3 \rfloor$ first, and then calculating the remainder using this value. The interesting part is the way in which it computes $\lfloor \frac x 3 \rfloor$. Apparently, for nonnegative integers less than 232, we have
$\left \lfloor \frac x 3 \right \rfloor = \left \lfloor \frac {2863311531x} {2^{33}} \right \rfloor. \quad (1)$

Indeed, if you replace the modulo operation in our mod3 function with an integer division operation (and rename the function with the more appropriate name of quo3) you’ll see the assembly output below in the Compiler Explorer:

quo3:
mov eax, edi
mov edx, -1431655765
mul edx
mov eax, edx
shr eax
ret

This is just equation (1) in x86 assembly language.

So, why does this equation hold? Well, let’s have a look at this mysterious constant 2863311531 that turns up in it. Often, when computers appear to be using a mysterious constant, things make more sense when you look at the binary expansion of the constant. The binary expansion of 2863311531 is this:
10101010101010101010101010101011.
Aha! It’s just 15 repetitions of the two-bit sequence 10, with an extra two 1 bits on the end. Another way to put it is that it’s m + 1 where m is the integer whose binary expansion is 16 repetitions of the two-bit segment 10.

What can we do with this knowledge? Well, a repeating sequence of a number of bits or digits is nothing more than a geometric series. Let’s write m as a geometric series:
$m = \sum_{k = 0}^{15} 2^{2k + 1} = \sum_{k = 0}^{15} 2 \cdot 2^{2k} = 2 \sum_{k = 0}^{15} 2 \cdot 4^k.$
This geometric series has initial value 2, common ratio 4 and 16 terms. Therefore, it can be evaluated as the fraction
$2 \frac {4^{16} - 1} {4 - 1} = 2 \frac {2^{32} - 1} 3 = \frac {2^{33} - 2} 3.$
Now a divisor of 3 has turned up, which is promising.

In the formula, we actually multiply x by m + 1, not m itself. If we add 1 to the fraction above, we get
$\frac {2^{33} + 1} 3.$
Multiplying by 2−33x, this comes out as
$\frac {x + 2^{-33} x} 3 = \frac x 3 + \frac {2^{-33} x} 3.$
Now, since x < 232, we have
$\frac {2^{-33} x} 3 < \frac 1 6.$
Since $\frac x 3$ is an integer divided by 3, it is impossible for its floor to change when you add a real number less than 1/3 to it. Since $\frac {2^{-33} x} 3 < \frac 1 6$, this proves equation (1). QED.

This technique readily generalizes to even word sizes other than 32, by the way. If the word size is an arbitrary even integer n, then to compute $\lfloor \frac x n \rfloor$ we just have to calculate
$\left \lfloor 2^{-(n + 1)} x \left( 1 + \sum_{k = 0}^{n/2 - 1} 2^{2k + 1} \right) \right \rfloor.$

What about moduli other than 3? If you play around with the Compiler Explorer, you’ll see that GCC uses roughly the same sequence of operations to calculate values modulo any constant, so there is a general trick at work here. I may try to reverse-engineer it in another post soon (or I may not; I don’t have a great track record of completing planned sequences of posts on this blog 🙂 )

1. Absolutes are rarely true, of course. All else being equal, it’s better for a computer to have a longer battery life, and this is facilitated by it not having to carry out too many complex operations. But this isn’t a critical concern, compared to things like the capacity of the computer to do what the user wants in a reasonable amount of time, considering that its battery can always be recharged. Likewise, there is a small chance of a freak mechanical failure with every operation, so the more operations are done, the more likely such errors are; however, the base chance is still so low that this hardly ever becomes a matter of concern to users.

2. Careful readers might wonder whether this formula is oversimplified, since the multiplication of $\lfloor \frac {2863311531x} {2^{33}} \rfloor$ by 3 might result in overflow, in which case it would be $(3 \lfloor \frac {2863311531x} {2^{33}} \rfloor) \bmod {2^{32}}$ that would be subtracted from x, not the full product $3 \lfloor \frac {2863311531x} {2^{33}} \rfloor$. However, this multiplication will actually never overflow. You can convince yourself of this by considering the case where x is as large as possible, i.e. x = 232 − 1.

## The difference between key-value containers and functions

I was wondering why all of the programming languages I know treat key-value containers (i.e. sequences and associative arrays) differently from functions. From mathematics, I’m used to thinking of key-value containers as functions of one argument, taking a key as input and giving a value as output. Treating key-value containers as functions would have some nice effects, such as allowing the free use of infinite containers and allowing the use of function composition in place of the higher-order function normally called map.

Here’s my guess: one major difference between key-value containers and functions in most programming languages is that the values in containers are precomputed, while the values of functions are computed only when the function is called on the argument. This can be illustrated if we try using function composition in place of map in Racket:

(define (my-map f xs)
(lambda (i)
(f (xs i))))

(define (my-sequence i)
(match i
(0 1)
(1 4)
(2 9)))

(define (write-and-square x)
(writeln x)
(* x x))

(writeln "squared-list:")

; displays 1, 4 and 9
(define squared-list (map write-and-square (list 1 4 9)))

(writeln "my-sequence-squared created:")

; nothing displayed
(define my-sequence-squared (my-map write-and-square my-sequence))

(writeln "my-sequence-squared accessed:")

(writeln (my-sequence-squared 0)) ; displays 1 and 1
(writeln (my-sequence-squared 1)) ; displays 4 and 16
(writeln (my-sequence-squared 2)) ; displays 9 and 81

When we use Racket’s map function to define squared-list, the members of the list have write-and-square applied right away, and we see the side-effect of the function. But when we use our my-map function, the evaluation of write-and-square on a given key is delayed until the value the key maps to is actually requested. This is a sort of lazy evaluation. Although lazy evaluation has its uses, it can lead to inefficiency if a key is accessed repeatedly and the value it maps to is fully recomputed again and again needlessly without a cache being stored in memory.

To deal with this, we can use memoization. However, since an arbitrary function may have an infinite domain, we have to choose which values to memoize. Here’s a Racket function for carrying out the memoization (using some eval black magic to dynamically write a pattern match):

(define (memoize mapping keys)
(define cases (map (lambda (key)
(list key (mapping key))) keys))

(lambda (key)
(eval (append
(list 'match key)
cases
(list (list '_ (list mapping key)))))))

And here’s a test showing the function has the expected behaviour.

(writeln "my-sequence-squared-2 created:")

; displays nothing
(define my-sequence-squared-2 (my-map write-and-square my-sequence))

(writeln "my-sequence-squared-3 created:")

; displays 1 and 4
(define my-sequence-squared-3 (memoize my-sequence-squared-2 (list 0 1)))

(writeln "my-sequence-squared-3 accessed:")

(my-sequence-squared-3 1) ; displays 16
(my-sequence-squared-3 2) ; displays 9 and 81

Now, for functions with finite domain, a natural default is to memoize all of the values. But how do we do that? Given an arbitrary function, we don’t know whether its domain is finite, and even if its domain is finite, we don’t know how to iterate through all of the domain’s members. So we can’t just add some extra code in my-map saying something like “if the function has a finite domain, memoize each of the values of the composition”. Our options are to either require users of my-map to explicitly memoize whatever specific values need memoizing, or add some extra information to the functions themselves specifying whether they have a finite domain and how they can be iterated over, which my-map can then take advantage of.

A natural way to supply this information is to use objects of a distinct type, a “container type”, instead of functions on finite domains. Any object with a container type can be assumed to be finite. Furthermore, we can have multiple container types, each associated with their own iteration protocol.

This, then, is the purpose, or at least a purpose, of having containers as a distinct concept from functions. Containers are functions together with the necessary additional information required to memoize all of their values behind the scenes whenever they are composed.

Source code for this blog post (note: the last couple of lines give an error when the file is run in DrRacket, but if you type them into the REPL after running the rest of the program they work fine, so I assume the problem is DrRacket’s and not mine 🙂 )

## Reversing a linked list in place

The other day I finally realized how you reverse a linked list in place.

I had attempted to tackle this problem several times before, and had always ended up looking up the answer online, copying and pasting some code given in the answer and seeing that it worked, but not really understanding how it worked, as evidenced by my failing to remember how to do it the next time. Now, however, I’m confident that I can remember it for the long term.

It’s remarkable how unhelpful the online resources were here. Consider this description of the algorithm from GeeksforGeeks.com, which was the first result Google gave me for “reversing a linked list” as I was writing this post:

1. Initialize three pointers prev as NULL, curr as head and next as NULL.

2. Iterate trough the linked list. In loop, do following.
// Before changing next of current,
// store next node
next = curr->next

// Now change next of current
// This is where actual reversing happens
curr->next = prev

// Move prev and curr one step forward
prev = curr
curr = next

This is a particularly bad explanation—it’s not actually complete, and even the spelling and grammar is poor—but in terms of how it describes the solution it’s not substantially different from the others I saw. It just says you have to use three pointers and do a particular sequence of reassignments until one of them is null. There is little explanation as to why this particular sequence of reassignments works. For the reader, it’s just a magic sequence that has to be memorized.

In order to understand anything complex, you have to use abstractions, to break the complexity into manageable chunks, and analogies, to allow you to apply your existing knowledge about other things. For me, the key to understanding how to reverse a linked list was realizing that what you do to the list in the loop above can be understood as simply carrying out the two operations below in succession:

1. Remove the first item from the list.

2. Attach this item to the front of a new list.

The processes of attachment and removal can be readily visualised, so they allow us to draw on our spatial intuition to reason about data structures. It’s obvious that if we keep removing items from the original list and attaching them to the new list until there are no items left to remove from the original list, then the new linked list will end up having the same items the original list originally had, but in reverse order. The new list can then be assigned to the variable holding the original list.

It’s really an incredibly straightforward algorithm. I think the reason it took me so long to figure it out was that I was disregarding all ideas for solutions that on a conceptual level involved moving items into a new list and then reassigning the variable. This was a case where spatial intuition was leading me astray: such solutions are conceptually not “in-place”, so I assumed any such solution would have be actually not in-place; an amount of space proportional to the size of the list would need to be used in order to store the new list. Of course, I was forgetting that you can remove items from the original list as you attach them to the new list, so that no additional space needs to be used that wasn’t already being used to store the original list.

Here’s the code I ended up writing (in C):

#include <stdlib.h>

struct node {
int value;
struct node *next;
};

void list_push(struct node **list, struct node *node) {
node->next = *list;
*list = node;
}

struct node *list_pop(struct node **list) {
struct node *node = *list;
*list = node->next;
return node;
}

void list_reverse(struct node **list) {
struct node *new_list = NULL;

while (*list)
list_push(&new_list, list_pop(list));

*list = new_list;
}

Note that if you inline the list_push and list_pop functions in list_reverse, you get what is essentially the loop with three pointers that all the online resources were talking about:

void list_reverse(struct list **list) {
struct list *new_list = NULL;

while (*list) {
struct list *node = *list;
*list = node->next;
node->next = new_list;
new_list = node;
}

*list = new_list;
}

Only the variable names are different: list, node and new_list correspond to the variables next, curr and prev respectively in the GeeksForGeeks.com description.

There are of course different ways of conceptualizing what happens during the loop. The conceptualization described above, where we move the items into a new list by attaching to the front, is just the one that came most naturally to me. From the variable names the writer of that GeeksForGeeks.com article used, it’s evident that their conceptualization was different—prev doesn’t make a lot of sense as a name for a new container we’re placing items into.

Now that I look at it again, I see that the GeeksforGeeks.com article includes an animation which illustrates the conceptualization they were probably using. You can think of the linked list as a bunch of items connected by arrows, where the arrows represent pointers. To reverse the list, you just flip the arrows!

Once you conceptualize the problem in this way, I think it’s fairly straightforward to work out how to do it. You have to iterate over the list, keeping both the current node and the previous node in memory so that you can point the current node to the previous one. The previous one either will have been already set to point to one before it, or is the first node in the list, in which case it needs to be pointed to NULL. We’ll also need make temporary copies of what the current node was originally pointing to on each iteration, so we can move on to the next node even after we change what the current node is pointing to. So with this conceptualization, we can easily see that we need three pointers, and that prev, curr and next are appropriate names for them.

void list_reverse(struct list **list) {
struct list *prev = NULL;
struct list *curr = *list;

while (curr) {
struct list *next = curr->next;
curr->next = prev;
prev = curr;
curr = next;
}
}

This makes it more clear where the GeeksforGeeks.com article was coming from. To its writer, the sequence of assignments in the loop probably seemed like a simple atomic operation—“make the current item point to the previous one”. But the writer failed to explicitly point out that they were making use of this conceptualization, and the clarity of their explanation for a beginner suffered because of it.

It’s likely that they weren’t particularly aware that they were relying on this particular conceptualization, or that there were alternatives. I think this is a general problem with trying to explain things to other people. People tend to stick with the first conceptualization of a problem that proves helpful for them, and think only in those terms. Even if they are aware of alternative conceptualizations, they may think of those conceptualizations as misunderstandings of the problem, when it’s really just a case of different conceptualizations having their own advantages and disadvantages and suiting different thinking styles. Even if some conceptualizations are less helpful than others, it may be best to allow a beginner to work with the conceptualization they find most natural as a starting point.

Anyway, all this has got me curious—if there is anybody still reading who has ever tried to reverse a linked list, how do you prefer to conceptualize the linked list reversal problem? Moving items, or flipping pointers, or something else? Leave a comment—I’d be interested to hear your answer.