# Tag Archives: logic

## An introduction to Turing machines

The story of how Alan Turing made important contributions to the allied war effort in World War II as part of a team cracking German codes, only to later end up being prosecuted for homosexual activity and chemically castrated as punishment, is well-known. If you didn’t already know about it, you might have seen the film about it, The Imitation Game, which came out recently. But this story, while worth knowing about, is not the whole story of Turing’s life: he was also a rather important mathematician. In fact, he is considered one of the founders of computer science. His principal contribution was the concept of the Turing machine, and in this post, I’ll try and explain what the Turing machine is and its significance, so that you can learn something about what will, in the end, probably be his most enduring claim to fame.

Essentially, a Turing machine is an abstract model of a computer, which we can use to make mathematical proofs about what computers can do and what they cannot—or, in other words, what is computable and what is not computable. It’s important to bear in mind that when Turing came up with the concept, computers as we know them today, these complex machines with CPUs and memory, did not exist. In Turing’s time, the word ‘computer’ was mainly used to refer to people. It was an agent noun, formed from the verb ‘compute’. The prototypical kind of computation, to somebody in Turing’s time, would have been arithmetical calculation: adding two numbers, multiplying two numbers, that sort of thing. This sort of computation, of course, can be done in your head, or, failing that, with the aid of a pencil and some paper. No machines need be involved, although they might be able to do it a lot faster. The point is that computation is a pretty fundamental concept that exists independently of the machines that we call computers. In fact, it is closely related to the concept of algorithms: computation can be thought of as the carrying out of algorithms. And, of course, algorithms have been an important concept in mathematics since it began—the ancient Babylonians knew and used algorithms to solve linear and quadratic equations.

The problem Turing was trying to solve was this: how can computation be defined precisely? It is, of course, fairly easy to give a non-precise definition of the concept. To compute is to carry out an algorithm. And algorithms can be characterised by the following properties, paraphrased from the Stanford Encyclopedia of Philosophy. An algorithm is a procedure for obtaining a result such that

• the algorithm consists of a finite number of instructions, which are precise and unambiguous,
• following the instructions results in the production of the result after a finite number of steps (the algorithm does not run forever),
• the instructions can be followed successfully by a human without any special equipment, except paper and a pencil to aid the memory,
• and, as long as the instructions are understood and followed correctly, they can be successfully followed; no ingenuity is required of the human.

(The term that tends to be used, rather than ‘algorithm’, in this area is ‘effective procedure’, but it means the same thing. I’ll stick with ‘algorithm’ in this post, since it’s probably more familiar.)

There is nothing wrong with non-precise definitions; most concepts do not have precise definitions, and indeed most concepts probably cannot be given precise definitions. Before Turing mathematicians had mostly managed just fine without a precise definition of computation. But Turing had a pretty good reason for wanting a more precise definition. In 1928, David Hilbert posed a problem known as the Entscheidungsproblem, or, in English, the ‘decision problem’. This problem asked whether there was an algorithm that could, basically, solve mathematics. More specifically, it asked whether there was an algorithm that could show whether an arbitrary sentence of first-order logic was true or false. First-order logic is a precisely defined language in which virtually every mathematical statement can be expressed, including conjectures like Goldbach’s conjecture (that every even integer greater than 2 is the sum of two prime numbers) that mathematicians have been trying to prove for centuries. If there was an algorithm that could show whether an arbitrary sentence of first-order logic was true or false, we could simply run the algorithm on Goldbach’s conjecture and this centuries-old problem would be solved. (It might be a bit of an exaggeration to say that mathematics would be completely solved, though, because running the algorithm might be practically infeasible, if it required too much computing power. Also, there might be interesting questions that are not expressible in first-order logic. Still, it can’t be denied that the solution of Entscheidungsproblem would have an enormous impact on our conception of mathematics.) But in order to give a proof that such an algorithm could exist or could not exist and thus solve the Entscheidungsproblem, it was necessary first to give a mathematical definition of computation. Otherwise, anyone who claimed to have found such an algorithm could be countered with the accusation that their algorithm wasn’t a real algorithm, or anyone who claimed to have proven that such an algorithm was impossible could be countered with the accusation that they hadn’t considered every possible algorithm.

But now you might be wondering: wouldn’t any precise definition of computation suffer from the same problem? Any proof based on the definition could be countered with the accusation that you weren’t using the right definition. If, for a given definition, there was an algorithm that could decide whether an arbitrary mathematical statement was true or false, people could say that the definition was too inclusive. And if it could be proven that there was no such algorithm, people could say that the definition was too exclusive. Admittedly, the first possibility was less serious, because, presumably, if there was such an algorithm, we could just see how it works, and it would be fairly easy to see, on an intuitive basis, whether it was an algorithm by checking whether it didn’t require ingenuity, et cetera. As of today, however, one thing is certain: nobody has managed to find such an algorithm. So we have to consider the second possibility, too. But, since our intuitive concept of algorithms is somewhat vague, how can we ever be sure that our definition doesn’t exclude some algorithms that we’d intuitively recognise as algorithms? This problem seems especially insurmountable given that algorithms can be enormously complicated, indeed, they can be complicated to an arbitrary degree; we don’t have a lot of intuition as to the upper limits of what algorithms are capable of.

It is certainly impossible to ever be sure in the mathematical sense that a given definition of computation is the correct one. But there is some pretty convincing evidence that Turing’s definition is the correct one. There were actually quite a few definitions of computation proposed during the 1930s, along with Turing’s definition via Turing machines. Principal among these were Alonzo Church’s definition via lambda calculus and Kurt Gödel and Jacques Herbrand’s definition via recursive functions. I won’t go into these alternative definitions of computation here, but if you look them up yourself, you’ll see that they were quite different in nature from each other and from Turing’s definition (which I’m going to get to shortly). Yet it turned out, conveniently, that all of these definitions were equivalent: the procedures defined as algorithms by Turing’s definition were exactly those defined as algorithms by Church’s definition and exactly those defined as algorithms by Gödel and Herbrand’s definition. And this is quite a coincidence. Consider the collection of all algorithms, A. Each definition picks out a certain sub-collection A′ of A (i.e. a collection which consists of algorithms, but does not necessarily contain every algorithm), and only the algorithms in this sub-collection are considered algorithms by that definition. The fact that the definitions are all equivalent means that each one picks out exactly the same sub-collection. Why would this be the case? One explanation would be that each definition is, in fact, picking out the biggest sub-collection possible—the collection A itself. That’s why the vast majority of mathematicians today believe that Turing’s definition (and Church’s, and Gödel and Herbrand’s, since they coincide with Turing’s definition) is the correct one. This belief has a name: it’s known as Church’s thesis (or the Church-Turing thesis; I suppose calling it the Church-Turing thesis would be more in accordance with the theme of this post, but ‘Church’s thesis’ is less verbose). It is not, by its nature, definitely provable to the standards mathematicians generally prefer, but I don’t think anybody seriously doubts it to be true. After all, people have come up with lots of other definitions of computation since the 1930s, and they have all ended up coinciding with the original definitions of Gödel, Herbrand, Church and Turing. And nobody has ever managed to think of an algorithm that does not satisfy these definitions. It is worth noting that while Church’s thesis cannot be definitely proven, it could be definitely disproven, if there did turn out to be an algorithm that was not considered an algorithm by the definitions. This is part of the reason why we believe it: it’s a falsifiable belief, and yet nobody has managed to falsify it.

Let’s see what Turing’s definition was, then. The idea Turing had was to describe a simple kind of machine for carrying out algorithms. He didn’t call these machines ‘Turing machines’ himself (that would be kind of egotistical), but that’s what they have become known as. An algorithm could then be defined as any procedure that could be carried out by such a machine. Again, it’s interesting to note that for Turing, it was the idea of someone working out a calculation with pencil and paper that was foremost in his mind when he thought about computation. Paper is two-dimensional, so, to simplify matters, he said that the machine would make its calculations on a one-dimensional strip of paper, a ‘tape’, which would be divided into squares. There would be infinitely many squares; the machine would not be limited by constraints of space. (After all, a person carrying out a calculation can always find more paper.) Each square would contain space for a single symbol to be written, chosen from a finite collection of available symbols. The square could of course also be blank, but we can consider that to be just another symbol. The machine would ‘look’ at a single square at a time and its action would be determined by the symbol on that square. However, the machine could be placed in a finite number of different states, so it would not always do the same thing upon seeing a given symbol; its action would also depend on its state. The possible actions would be the following:

1. Write a new symbol on the square currently being looked at (this may be the blank symbol, so that the old symbol is erased).
2. Move the tape one square to the left, so that the machine ends up looking at the square directly to the right of the one that was originally being looked at. This is called ‘shifting right’.
3. Move the tape one square to the right, so that the machine ends up looking at the square directly to the left of the one that was originally being looked at. This called ‘shifting left’.
4. Stop the computation. This is called ‘halting’.

After performing one of these actions, the machine would also possibly change to a different state (if it did not halt). In more abstract terms, the machine would be completely specified by its possible states, its possible symbols, and an association of each state q and each symbol S with an action and a state, corresponding to the action performed, and the new state entered (which may be the same state) when the machine is looking at S while in state q. If the action is action 4 (the halting action), it is pointless to specify a new state to enter, so we can adopt the convention that if there is no association of one of the state-symbol pairs with an action and a new state, then the machine halts if it looks at that symbol while in that state. The machine would have to halt at some point in order to be useful. If the machine did halt, the sequence of symbols remaining on the tape afterwards would be considered the result of the computation. This result may depend on the initial sequence of symbols that was on the tape before the computation was started. So the machine could be thought of as computing a function, which takes as input the initial sequence of symbols, and returns as output the final sequence of symbols. (If the machine does not halt for some input, the function is undefined for that input.) The sequences of symbols could be interpreted any way you like. For example, if the symbols are the blank symbol and 1, you could interpret a sequence of symbols as a sequence of numbers by interpreting each sub-sequence of consecutive 1s as the number of 1s in the sub-sequence, with the blank symbols functioning as separators. So, for example, 11 111 would represent (2, 3), 1111 would represent (4), 111 11 1 would represent (3, 2, 1).

It might not seem obvious that such a simple machine really can carry out any computation. So, I’ll show you a few examples of Turing machines that can carry out calculations with numbers, using the interpretation of sequences of blank symbols and 1s as sequences of numbers given above. Here’s a very simple example: a Turing machine that adds 1 to the input (which should be a single number). It’s represented as a flow chart. To find out what the machine will do in the state q while looking at the symbol S, look at the node which is labelled q, and then see if there is an arrow with a label of the form S : A coming out of this node. If there is such an arrow, this means the machine will carry out action A, and then change state to state p, where p is the label of the node the arrow points to. If there is no such arrow, this means the machine will halt. By the way, I’m using 0 here to represent the blank symbol.

(sorry about the crappy hand-drawn diagram, but I couldn’t find any good software for drawing these)

In order for this machine to work, the input must only contain one string of consecutive 1s, and the machine must start by looking at one of these 1s. Then the machine will shift left until it ends up looking at the first 0 to the left of this string of consecutive ones. It will then change this 0 to a 1, and then halt. So the string of consecutive 1s has an extra 1 added to it.

Here’s a more complicated example, building on this one: a Turing machine that adds the two numbers in the input.

(again, sorry about the crappy lighting, but I don’t have a scanner)

This one is probably too complicated to understand just by looking at the diagram. The idea here is that the machine will start by looking at one of the 1s in the first number in the input (i.e. the leftmost sequence of consecutive 1s). Then it will shift left, past this sequence of consecutive 1s, and further past the sequence of 0s that separates the two numbers (this requires a change of state), until it reaches the leftmost 1 in the second sequence of consecutive 1s. Then it changes this 1 to a 0, and—here’s the clever part—it shifts to look at the square immediately to the right. If this square is a 1, the machine shifts left until it reaches the first 0 to the left of the first sequence of consecutive 1s. Then it changes this 0 to a 1. Effectively, the 1 that was in the second sequence gets transferred to the first sequence. Then the machine goes back to state 1, and the process repeats. But at some point, it will change the final 1 in the second sequence to a 0, and then, when it shifts right, it will encounter a 0 rather than 1. This tells the machine that it has reached the end of the second sequence and, therefore, once it’s transferred this final 1 to the first sequence, it can halt (it does this by transferring to state 3, since state 3 has no action specified for the symbol 1).

A nice way to see how this works is to use this online Turing machine simulator. Copy and paste the following code into the text box:

0 1 1 r 0
0 _ _ r 1
1 _ _ r 1
1 1 _ * 2
2 _ _ r 3
3 1 1 l 4
4 _ _ l 4
4 1 1 l 5
5 1 1 l 5
5 _ 1 * 0
3 _ _ l 6
6 _ _ l 6
6 1 1 l 7
7 1 1 l 7
7 _ 1 * halt


and type in an input, such as 1111 111, into the textbox labelled ‘Initial input’. Then click the button labelled ‘Run’. You’ll see the machine going back and forth as it transfers the 1s from right to left, one by one. If you want, you can play around with that Turing machine simulator and try to do other things, such as multiplying two numbers, to help convince you further that Turing machines can carry out any computation. I can understand if you don’t want to, though, because, as you have probably gathered by now, programming a Turing machine is a very tedious and rather tricky task. Programming a computer can be thought of as a communication problem: you have to tell the machine what to do, but the machine only understands a very peculiar language which humans are not accustomed to thinking in. In practice, people program computers using programming languages such as Java or C++, which are closer to natural language (although still notoriously tricky to communicate to the computer with). Programs written in programming languages are translated into machine language, using programs written in machine language called compilers, so that the computer can understand them.

Now, let’s use this definition of computation to prove what we were trying to prove; that the Entscheidungsproblem is unsolvable. The proof was given by Turing in a 1936 paper, which is freely available here. For a more in-depth proof, you can look at this paper. At several points here, I’m going to have to appeal to Church’s thesis to assure you that certain things can be done by a Turing machine. But it is possible, at every point where this is done, to give the precise details of the Turing machine that could carry out the computation. It would just be a bit too tedious for me to go into. But Turing did go into the details in his original paper.

First, we will show a certain limitation on the power of Turing machines. Note that the input of a Turing machine can be taken to represent a Turing machine itself. This should be fairly clear: a Turing machine is just a set of states, a set of symbols, together with associations of some of the state-symbol pairs with action-state pairs, and clearly these things can all be specified in a finite list of symbols. So we can use Turing machines to carry out computations that tell us stuff about Turing machines. In fact, we can have something called a universal Turing machine. This is a Turing machine U such that for every Turing machine T and every sequence of symbols x, if U is given a sequence of symbols as its input that represents T together with x, then the output of U will be the output of T when it is given x as input. Or, if T halts when it is given x as input, U will halt. Basically, U is a ‘Turing machine simulator’; it can simulate any Turing machine. How can we be sure that a universal Turing machine exists? Well, Church’s thesis means that a universal Turing machine exists if and only if there is an algorithm for finding the output of an arbitrary Turing machine given an arbitrary input. And it is intuitively clear that there is such an algorithm. The fact that universal Turing machines exist is what enables computers to run arbitrary programs; computers are not just Turing machines, but in fact universal Turing machines.

Here’s an interesting question. Is there a Turing machine H that can tell us whether an arbitrary Turing machine T will halt when it is given an arbitrary sequence x of symbols as its input? Exactly how it tells us is not particularly important, but, for example, it might be that the output of H is the number 1 if T halts given x as its input and otherwise the output of H is the number 0. It is not obvious that this Turing machine exists. After all, a Turing machine can run for an arbitrarily long time before it halts. No matter how long you run a Turing machine for, there is still a possibility that it might halt in the future that you can’t rule out, unless there is some way to predict the machine’s future behaviour from its behaviour up to this point. And, intuitively, it does not seem like such a prediction is possible; the behaviour of the Turing machine can be arbitrarily complex; it might do one thing for the first million steps, then switch to doing something completely different.

That’s the intuitive argument that H cannot exist. Now, let’s prove it. This is the only non-straightforward proof in this post; it’s not particularly complicated, but it is a little mind-bending. It uses an argument known as a diagonal argument, which also turns up in Russell’s paradox and Cantor’s proof of the uncountability of the real numbers. If you have never heard of either of those two things, you might have heard of the liar paradox. This is the paradox that arises from the statement ‘this statement is false’. The truth of this statement is equivalent to its own falsehood; it is self-contradictory. Diagonal arguments are a kind of generalisation of the liar paradox. But let’s see how the argument goes in this particular case.

Every Turing machine T can be represented by a sequence x of symbols. So, we can give this representation of T as the input to T. Using H, we can find out whether T halts when it is given this input. We have just described an algorithm for finding out whether an arbitrary Turing machine halts when it is given its own representation as its input. So there is a Turing machine G that can carry out this algorithm. In other words, for every Turing machine T and every sequence of symbols x, G does one thing if T halts when it is given the input x, and it does another thing if T does not halt when it is given the input x. The useful thing to do would be to do something like returning 1 in the first case, and 0 in the second case. But let’s suppose G does something slightly different instead: it does not halt in the first case, and it halts giving an output (it doesn’t really matter what output) in the second case. So, then, G halts if and only if T does not halt when it is given the input x. Can you see the problem here? G, like any other Turing machine, has its own representation x, and we can give x to G as its input. So what happens then? By definition, G halts if and only if G does not halt when it is given the input x. So, whether G halts or not, there is an absurd situation where it both halts and does not halt. We have shown that assuming the existence of H leads to a logical absurdity; therefore, H cannot exist.

And this is a big problem if we are trying to solve the Entscheidungsproblem. Because statements about Turing machines, including statements of the form ‘does the Turing machine T halt when it is given the input x’, can be expressed in first-order logic. So if we had an algorithm that could decide whether an arbitrary sentence of first-order logic, we could, among other things, decide whether an arbitrary Turing machine halts when it is given an arbitrary input. Since this is impossible, there can be no such algorithm. Thus we have proven that the Entscheidungsproblem is unsolvable. Actually, though, I haven’t been totally rigorous here. I haven’t properly justified my claim that all statements of the form ‘does the Turing machine T halt when it is given the input x’ can be expressed in first-order logic. For a start, in order to justify this claim, I would have to specify precisely what I mean by first-order logic, because there are actually lots of variants of first-order logic, and for some of these variants the Entscheidungsproblem can be solved (such variants are said to be decidable). However, every variant of first-order logic with a sufficient amount of expressive power is undecidable; it is only the simplest variants which are decidable. I can’t give proper proofs of these results without going into technical details about how first-order logic works. For now, I will leave you, hopefully, somewhat convinced that the Entscheidungsproblem is unsolvable, but fully convinced that the halting problem (the problem of finding out whether an arbitrary Turing machine halts given an arbitrary input) is unsolvable.

This post is mainly based on parts of the following two books:

• Boolos, G., Burgess, J. P., & Jeffrey, R. C. (2002). Computability and logic. Cambridge University Press.
• Wolf, R. S. (2005). A tour through mathematical logic. AMC, 10, 12.

Also, although I’ve already linked to it to it above, I’ll mention again that Turing’s original paper is available freely here.

## The conditional operator of formal logic

Most of the operators of formal logic correspond in a fairly straightforward way to words or phrases in English. $\neg$ is “not” $\wedge$ is “and” and $\vee$ is “or”. The equivalence operator, $\leftrightarrow$, does not really correspond to a well-known English phrase, but it’s pretty easy to understand its meaning: for every pair of logical formulae $P$ and $Q$, $P \leftrightarrow Q$ is true if $P$ and $Q$ have the same truth value (i.e. they are both true or both false) and false if they have different truth values. In other words, $\leftrightarrow$ expresses equality of truth values.

The exception among these operators is the conditional operator, $\rightarrow$. This is usually thought of as corresponding to the English word “implies”, so for every pair of logical formulae $P$ and $Q$, $P \rightarrow Q$ means “$P$ implies $Q$”, which can also be expressed as “$Q$ if $P$” or “if $P$, then $Q$”. However, this does not tell the full story. It is clear, when you think of the meaning of $\rightarrow$ in this way, that if $P$ is true but $Q$ is false, $P \rightarrow Q$ is false, because if $P \rightarrow Q$ were true, that would mean “if $P$, then $Q$”, so, since $P$ is true, $Q$ would have to be true rather than false. It may also seem reasonable that if $P$ and $Q$ are both true, so is $P \rightarrow Q$, and in fact this is correct. But if you want the meaning of $P \rightarrow Q$ to correspond exactly to “if $P$, then $Q$”, then you’ll have a problem with this: what if $Q$ just happens to be true as well as $P$, and there is no causal link between the two formulae? In that case, the statement “if $P$, then $Q$” would be false, since it would be possible to conceive of a circumstance in which $P$ is true but $Q$ is false. For example, it is true that I am currently alive. And it’s true, at this particular moment, that Barack Obama is the President of the USA. Yet I wouldn’t say it’s true that if I am alive at this particular moment, then Barack Obama is the President of the USA. Indeed, this statement actually seems to be false, since if Mitt Romney had won the 2012 election rather than Obama, that would not necessarily have caused me to die1.

The problem is that the notion of causality is a key part of the meaning of the English word “if”—basically, “if $P$, then $Q$” means that $P$ being true causes $Q$ to be true. But the operators of formal logic are supposed to be truth-functional: if we know the truth value of $P$, and we know the truth value of $Q$, then we should be able to work out the truth value of $P \rightarrow Q$. If we defined $P \rightarrow Q$ to have exactly the same truth value as “if $P$, then $Q$”, then this would not be the case: if we knew that $P$ and $Q$ were both true, we would still need some further information about the causal link between the two formulae before we could conclude that $P \rightarrow Q$ was true.

Rather than thinking of the meaning of $P \rightarrow Q$ as “if $P$, then $Q$”, I suggest you think of it as “when $P$, then $Q$. The point of switching to the word “when” is that “when” has the same meaning as “if” except that it lacks the connotation of a causal link. It seems less ridiculous to say that, when I am alive at this particular moment, then Barack Obama is the President of the USA2. The scenario in which Mitt Romney won rather than Obama has no relevance to whether this statement is true.

Using this conception of the meaning of $P \rightarrow Q$ we can work out how to determine its truth value. As we already said, if $P$ is true, then $P \rightarrow Q$ has the truth vaue of $Q$ (it’s true if $Q$ is true, and false if $Q$ is false). What if $P$ is false? Well, then $P \rightarrow Q$ is actually always true in a vacuous sense. To see why this is the case, it might be best to think about what happens if $P$ is a disjunction of multiple logical formulae $P_1$, $P_2$, … and $P_n$ (so that the meaning of $P$ is “$P_1$, or $P_2$, … or $P_n$”). Then $P \rightarrow Q$ means “when $P_1$, $P_2$, … or $P_n$, $Q$”. In order to check whether $P \rightarrow Q$ is true, then, you have to check that all the statements $P_1 \rightarrow Q$ is true, $P_2 \rightarrow Q$, … and $P_n \rightarrow Q$ are true. But if $n$ is 0 (and the disjunction of an empty list of statements is false, since none of the statements in the list are true), there are no statements to check, so you’re already done: you can conclude straight away that $P \rightarrow Q$ is true.

Another way to think about it which may help you make sense of it is to think about quantified statements: those of the form “all Xs are Ys”, e.g. “all male worker bees are avid fans of Doctor Who”. For every object $x$, let $P(x)$ be the statement “$x$ is a male worker bee” and let $Q(x)$ be the statement “$x$ is an avid fan of Doctor Who”; then this statement can be expressed in formal logic as the conjunction of all the statements $P(x) \rightarrow Q(x)$ for each object $x$. Is this statement true? Well, are there any male worker bees that aren’t avid fans of Doctor Who? No, there aren’t, because there are no male worker bees, and therefore, all male worker bees really are avid fans of Doctor Who. That means that for every object $x$, $P(x) \rightarrow Q(x)$ is true. In particular, for every object $x$ which is not a male worker bee, so that $P(x)$ is true, $P(x) \rightarrow Q(x)$ is true, regardless of whether $x$ is actually an avid fan of Doctor Who or not.

Isn’t formal logic weird?

#### Footnotes

1. ^ Mitt Romney wouldn’t have been that bad a President.
2. ^ If you heard this statement in conversation, it would, however, acquire additional connotations due to Grice’s maxim of quantity. If the speaker merely wanted to say that Barack Obama was the President of the USA, they would just say that, and the fact that they didn’t say that suggests it is not in fact true, so perhaps if the speaker wasn’t alive at this particular moment, Barack Obama wouldn’t be the President of the USA. However, the statement is true in the same sense that it is true for a person with two eyes to say “I have one eye”.

## What are numbers?

Numbers are the central object of study in mathematics. Most humans, and certainly all humans in modern, industrial societies, are familiar with numbers and, indeed, use them regularly in their daily lives. Despite this, non-mathematicians will probably find it difficult to give a proper explanation of what numbers actually are. I don’t mean that people actually misunderstand what numbers are—in fact most people do understand what they are—it’s just that it’s difficult to put this understanding into words. A lot of the concepts that are fundamental to our understanding of the world are like this—easy to understand implicitly, hard to explain explicitly. As an example, think about the concept of ‘colour’. Five-year-old kids can understand this concept. But can you explain it in a way that a five-year-old kid could understand? I don’t think I can.

One problem with explaining basic concepts like this is that explanations necessarily make use of concepts themselves. In order for an explanation of a concept to be a proper explanation it has to not make use of the concept itself (for example, it is not very helpful to explain that 2 is the second number since the explanation of the meaning of ‘second’ relies on knowledge of the meaning of 2). So it is impossible to properly explain a concept without relying on other concepts being understood. In this sense, it is actually impossible to explain a concept completely. This is why if you repeatedly ask someone ‘Why?’, they always end up not being able to give a satisfactory answer. It’s not just that the person you’re asking isn’t smart enough: no matter how intelligent the being you ask is, they will be stumped after a given number of ‘Why?’s.

The fact that nothing can be completely explained is not quite as disturbing as it may sound: really, it’s just an indication that maybe we should define what it means for an explanation to be complete in a different way, or do away with the concept altogether. It is unhelpful to think of explanations as complete or not in an absolute sense, but an explanation can be complete given a set of concepts if it does not rely on any concepts other than those in the set. The concepts in the set can be referred to as the fundamental concepts used in the explanation. As long as the fundamental concepts are understood, the explanation is perfectly satisfactory.

Now we can state the reason why it’s hard to explain basic concepts: for complex topics, you can usually find an explanation which relies fundamentally only on basic concepts, and these explanations seem more satisfying because you can assume the basic concepts are already understood. But if you’re trying to explain a basic concept, there aren’t many concepts which are even more basic; it’s likely that some of the fundamental concepts in the explanation will not be much more basic than the one you’re trying to explain. That makes it harder not to feel like those concepts need an explanation too.

The reason I say all this is to warn you that you might have this problem with the explanation of what numbers are given in this post. This explanation relies fundamentally on the meaning of the word set. So whether you’ll be happy with it depends on whether you think your understanding of what a set is relies on your understanding of what a number is. If you think it does (and I don’t think this is a ridiculous position to hold) then this explanation might seem pretty useless!

You probably do understand what a set is, even if you haven’t come across the term in its mathematical sense. The meaning of the word ‘set’ in mathematics is basically the same as the normal English meaning of the noun ‘set’: a set is a group or collection of objects. The objects contained in a set are called its members. The important thing to realise about how mathematicians use the word ‘set’ is that mathematical sets do not have any additional structure to them beside their members. To see what I mean, consider a fairly concrete example of a set: the set of items in a shopping bag. If you wanted to describe the shopping bag, there is a lot more you can say about it than just saying what items it contains. For example, you can describe the nature of the bag itself, or you can describe the arrangement of the items within the bag. But these aspects of the bag are not aspects of the bag as a set. When we think about the bag as a set, all those details are abstracted away, and we are only concerned with which objects are in the set.

If you are familiar with formal logic, it might help if I state the following rule which gives a condition under which two rules are equal.

$(\forall A) (\forall B) ((\forall x) (x \in A \Leftrightarrow x \in B) \Rightarrow A = B).$

For every object $x$ and every set $A$, $x \in A$ is the statement that $x$ is a member of $A$. Therefore, in words, this rule states that

For every pair of sets $A$ and $B$, $A = B$ if it is the case that for every object $x$, $x$ is a member of $A$ if and only if $x$ is a member of $B$.

It might also be helpful to see that this can be expressed equivalently as

For every pair of distinct sets $A$ and $B$, there is an object $x$ which is in one of $A$ and $B$ but not both.

This is just a more precise way of saying what I was talking about above, about how sets have nothing more to them than their members.

So, how do sets help with defining what numbers are? Well, first let’s note that it is possible to speak of the number of members a set contains. This number is called the cardinality of the set. For example, there is a set with no members at all (which is called the empty set); that set has cardinality 0. And sets with just a single member have cardinality 1. Note that some sets have infinitely many members, so we cannot assign those sets a cardinality. If a set can be assigned a cardinality, its cardinality will always be a natural number (a non-negative integer such that 0, 1, 2 or 3).

Perhaps, then, we could define the natural numbers in terms of cardinality. We would have to find some way of distinguishing sets that have different cardinalities that does not rely on the notion of number already. This would allow us to classify the sets according to their cardinalities. Then each natural number can be defined as a symbol representing one of the classes. For example, 0 would represent the class of all empty sets, 1 would represent the class of all sets with a single member, and so on.

There is, in fact, quite a simple way of distinguishing sets with different cardinalities. But before I can articulate this I’m going to have to introduce a couple more concepts. The first concept is that of a correspondance from one set to another. A correspondance from a set $A$ to a set $B$ is simply a way of distinguishing certain pairs of objects $a$ and $b$ where $a$ is a member of $A$ and $b$ is a member of $B$. The objects in the distinguished pairs are thought to correspond to each other. Correspondances between finite sets can easily be represented as diagrams: just write down the members of each set in two separate vertical columns and draw lines linking the distinguished pairs. Here are some examples of correspondances, represented by diagrams, between the sets $\{1, 2, 3\}$ and $\{A, B, C\}$ (i.e. the set of the first three positive integers and the set of the first three capital letters; the curly-bracket notation used here is a standard one for writing sets in terms of their members). I’ll call these correspondances $R$, $S$ and $T$, going from left to right.

The last correspondance, $T$, is the kind we’re interested in. By $T$, every member of $\{1, 2, 3\}$ corresponds to exactly one letter, and every member $\{A, B, C\}$ corresponds to exactly one number. In terms of the diagram, each object is linked to exactly one object in the other set by a line. Correspondances like $T$ are called bijections. $R$ and $S$ are not bijections, because 3 corresponds to no letter by $R$ and 1 corresponds to both A and B by $S$.

It’s no coincidence that $\{1, 2, 3\}$ and $\{A, B, C\}$ both have three members. In fact, there is a bijection between two sets if and only if they have the same number of members. This may be pretty clear to you from the definition of “bijection”. If it isn’t, consider this: when you count the members of a set like $\{A, B, C\}$, you assign 1 to A, 2 to B and 3 to C, and so you are actually showing that the correspondance $T$ exists. Of course you could also count the other way and assign 3 to A, 2 to B and 1 to C; that would still be a perfectly valid way of counting, because the correspondance thus established is still a bijection. But if you were to “count” by establishing the correspondance $S$, i.e. by assigning 1 to A, 1 to B and 2 to C, and then conclude that $\{A, B, C\}$ has only 2 members because you only got up to 2, this would obviously be wrong. The ways of counting which work are exactly those which establish a bijection between $\{1, 2, 3\}$ and $\{A, B, C\}$.

This, then, is how we can distinguish sets with different cardinalities. So then we can define each natural number, as explained above, as a symbol representing the class of all sets with a given cardinality. Of course, to be totally sure that this definition works we have to prove that all the things we expect to be true about natural numbers are true when they are defined in this way. Now, in the 19th century, Giuseppe Peano gave an axiomatisation of the natural numbers: a set of self-evidently true statements (called axioms) which could be used to prove all true statements about the natural numbers1. These axioms are as follows.

1. For every natural number $n$, $n + 1$ is not 0.
2. For every pair of natural numbers $m$ and $n$ such that $m + 1 = n + 1$, $m = n$.
3. For every set $S$, if $S$ contains 0, and $S$ has the property that for every member $n$ of $S$, $n + 1$ is a member of $S$, then for every natural number $n$, $S$ contains $n$.

In order to check that the axioms are satisfied we’ll have to define 0 and 1 and explain how natural numbers can be added using this definition. Clearly, 0 should represent the class of all sets with no members and 1 should represent the class of all sets with exactly one member. Now, suppose $m$ and $n$ are natural numbers and let $A$ and $B$ be sets from the respective classes which they represent. $A$ and $B$ can be chosen so that they have no members in common, and since $A$ has $m$ members and $B$ has $n$ members, the union of these two sets, $A \cup B$—the set which contains the members of both $A$ and $B$—has $m + n$ members. So we define $m + n$ as the class of all sets equal in cardinality to $A \cup B$.

Now you have the tools you need to prove that the axioms are satisfied. I’m going to end this post here, although perhaps I’ll post the proofs some other time.

### Footnotes

1. However, this is not actually true: there are true statements about the natural numbers which cannot be proven by Peano’s axioms. One example is Goodstein’s theorem. Goodstein’s theorem is known to be true because it can be proven if you develop the theory of the ordinal numbers, a larger class of numbers which contains the natural numbers. There are, however, still true statements about the ordinal numbers which cannot be proven within that theory. In fact, any consistent theory stronger than that given by Peano’s axioms is incomplete in this way—this is the famous incompleteness theorem of Kurt Gödel.

## The axioms of set theory, part 2

The problem with the axiom schema of comprehension is very simple. By the axiom schema of comprehension there must be a set, which we’ll call the Russell set, which consists of the sets which do not contain themselves. Consider the question of whether the Russell set contains itself. Well, if the Russell set contained itself, that would mean it was a set which did not contain itself, since all the members of the Russell set do not contain itself. Since this is contradictory, the Russell set must not contain itself. But the Russell set contains every set which does not contain itself, so it must contain the Russell set. No matter how you answer the question, a contradiction is inevitable.

[Russell’s paradox was actually not the only paradox in naive set theory; there were also, for example, the Burali-Forti paradox and Cantor’s paradox. However, these involve the concepts of ordinals and cardinals, respectively, which would take a lot of explaining. Maybe I’ll do a post about them later.]

Like most paradoxes, Russell’s paradox could be worked around; but mathematicians came up with a few different ways to work around the paradox. In this post I’m talking about ZF, the most popular set theory, but there are others, like Quine’s New Foundations (NF), which are just as adequate for mathematics and work around Russell’s paradox in a completely different way. This is problematic if you think of axioms as self-evident truths, since ZF and NF contradict each other, so their axioms can’t all be self-evident. It’s probably for this reason that mathematicians have generally abandoned this definition.

I’ll list the axioms of ZF which replace the axiom schema of comprehension below.

• The axiom schema of specification is a limited version of comprehension, which asserts the existence of arbitrary sets provided they are subsets of an existing set. A notable consequence of this is that no superset of the Russell set can exist, including the set of all sets.
• The axiom of pairing says that for every pair of sets, there is a set containing both of those sets.
• The axiom of union says that for every set $A$, there is a set, called the union of $A$ whose members are all the members of members of $A$. In other words, the union of $A$ is what you get when you merge all the sets that $A$ contains into a single set.
• The axiom of power set says that for every set $A$, there is a set, called the power set of $A$, whose members are all the subsets of $A$.
• The axiom schema of replacement says that for every function whose domain is a set, the range is a set as well. This is an axiom schema, because a function is essentially a logical formula with the property that for every member $x$ of the domain, there is exactly one object satisfying the formula when $x$ is taken as its argument (this object is the output of the function for the input $x$.

All these axioms are very plausible, though one might wonder how the existence of these sets in particular is any more self-evident than the existence of, for example, the set of all sets. The last two axioms of ZF are somewhat more controversial. By the way, both of these aren’t implied by the axiom schema of comprehension; they would be necessary even in naive set theory, if you accepted their truth.

First, there is the axiom of infinity, which essentially states that a set exists of infinite size. You might wonder how this can be phrased in logical language. Well, first you say that the set contains some particular set, such as the empty set, as a kind of starting point. Then you think of an operation that can be performed on a set, so that if you applied it repeatedly to the starting point set, you would get an infinite sequence of distinct sets. Such an operation is called a successor operation. Then the axiom can say that a set exists which contains the starting point set and has the property that if you apply the successor operation to any one of its members, the resulting set is still a member. The operation usually used is to construct a set consisting of all the members of the original set, plus the original set itself. This reason for this is that if we call the empty set 0, the successor of 0 consists of just 0. We can call this set 1. The successor of 1 consists of 0 and 1; we can call this set 2. The successor of 2 consists of 0, 1 and 2; we can call this set 3. We can go on like this and define all the natural numbers as sets, with every natural number being the set of all the preceding natural numbers.

The reason the axiom of infinity is controversial is because infinity is controversial. There are mathematicians, known as finitists, who essentially deny the existence of any infinite quantities, or at least (since it’s not clear what it means for a mathematical concept to exist or not exist) believe that infinity is not a useful mathematical concept.

[There are even mathematicians, called ultrafinitists, who deny the existence of all the natural numbers… according to them, the existence of sufficiently large natural numbers must be considered, at least, unproven. This isn’t as stupid as it might sound at first.]

The final axiom is the axiom of choice. Strictly speaking, this is not part of ZF. It’s controversial enough that if you include the axiom of choice in the theory, you have to refer to it as ZFC (Zermelo-Fraenkel set theory with choice). This one deserves a whole page about it, so it’ll be the subject of part 3!

## The axioms of set theory, part 1

An axiom is traditionally defined as a self-evident truth. However, a more accurate modern description is that an axiom is a statement that is accepted as true, without proof, within some logical system, and used as a starting point to allow other statements to be proved within the system. The most famous examples of axioms are those of Euclid, but the more relevant ones to modern mathematician are the Zermelo-Fraenkel (ZF) axioms of set theory. The ZF axioms are powerful enough to prove the vast majority of theorems that mathematicians are interested in.

This precise concept of a set is captured by the first ZF axiom, which is called the axiom of extensionality. It says that for all intents and purposes, two sets are exactly the same object if and only if each of them contains every object in the other one. In formal logical language, given two sets $A$ and $B$,

$A = B \Leftrightarrow \forall x \, (x \in A \Leftrightarrow x \in B).$

The ancient Greeks would probably have been satisfied with this axiom; it’s self-evident in that it’s essentially a statement of what the word “set” means.

Early set theorists completed their set theory by adding an axiom schema, called the axiom schema of comprehension. Before I explain what an axiom schema is, I’ll tell you what this axiom says: it says that for every conceivable logical formula, there is a set whose members are precisely the objects that satisfy the formula. Basically, for every collection of objects you might want to work with, there’s a set containing just those objects. The reason why this isn’t a true axiom is that an axiom is supposed to be a logical formula itself, and you can’t talk about logical formulae within logical formulae! Well, actually, there are systems of logic where you can do this, but it gets really complicated and it would certainly be better if we didn’t have to resort to such advanced systems of logic. Thankfully, there’s an easy way to get around the problem. We just say that for every logical formula, there’s an individual axiom which says that there’s a set whose members are the objects that satisfy the formula. This means our theory has an infinite number of axioms, which Euclid probably wouldn’t be too happy with. This doesn’t faze modern mathematicians too much though; I guess they leave it up to the philosophers of mathematics to argue over whether infinity is real.

[There is, actually, a way to finitely axiomatise NBG set theory, which is essentially the same as ZF but slightly stronger. Look it up if you’re interested.]

Note that we haven’t actually including any other axioms stating that objects exist. If we only have these two axioms, the only objects in our logical universe are, in fact, sets created by the axiom schema of comprehension. You might wonder, then, how there can be any objects at all, since sets have to contain objects themselves, don’t they? The axiom schema of comprehension doesn’t prove the existence of any sets which contain themselves, because there isn’t a logical formula that talks about the set if it doesn’t actually exist. The possibility that a set containing itself could exist isn’t ruled out, we just can’t prove that they exist. However, the axiom schema of comprehension does prove the existence of a set containing no members, which is called the empty set and denoted $\emptyset$. To see this, just let the logical formula be any one which involves a contradiction and is therefore false applied to every object. Once the empty set’s existence is proved, we can prove the existence of an infinite number of more complex sets, such as the set which contains only the empty set, the set which contains both of the afore-mentioned sets, and so on.

It would be very nice if all set theory required was these two axioms. Unfortunately, Bertrand Russell came along in 1901 and ruined everyone’s fun by showing that the axiom of comprehension proves a contradiction. More on that tomorrow.