The Enumerability of Strings

In my last post I mentioned that the set of finite strings over a countable (possibly infinite) alphabet is enumerable. I want to go over the details of that because I think it’s neat but I don’t think I made a compelling case for it in that post.

We’ll need a few definitions to get us started:

An alphabet $\Sigma$ is simply a set of symbols. For example, our English alphabet uses the symbols $\{a,b,c,\ldots,x,y,z\}$ and the Greek alphabet uses the symbols $\{\alpha,\beta,\gamma,\ldots,\chi,\psi,\omega\}$. Both of these examples are finite alphabets, but we can also imagine a countably infinite alphabet. For generic purposes, we’ll consider the symbols of this alphabet to be $\{\sigma_0,\sigma_1,\sigma_2,\sigma_3,\ldots\}$. For the purposes of our proofs, all the alphabets we discuss below will be countably infinite.

A string over an alphabet $\Sigma$ is a sequence of symbols in $\Sigma$, usually indicated between quotes. For example, an English string might be “apple” or “xoarotl”. A generic string would look something like “$\sigma_3\sigma_{99}\sigma_0\sigma_0\sigma_{806}$“. We also impose a restrictions that strings must be finite in length.

And now the meat of what we want to show:

Lemma: The set of strings of length-$n>0$ is enumerable.

Proof of Lemma: This can be proved by induction. The set of length-1 strings is trivially enumerable by the function $\lambda_1$ where $\lambda_1(i)=\sigma_i$. Then, given a function $\lambda_k$ that enumerates the length-$k$ strings, we can enumerate the length-$k+1$ strings by creating the following table, where each $\lambda_k(i)$ runs across the x-axis, and the symbols of our alphabet run down the y-axis. In each spot in the table, we append the symbol from the y-axis to the string from the x-axis to obtain the listed result (let $\&$ be the concatenation operator).

 $\lambda_k(0)\&\sigma_0$ $\rightarrow$ $\lambda_k(1)\&\sigma_0$ $\lambda_k(1)\&\sigma_0$ $\rightarrow$ $\lambda_k(1)\&\sigma_0$ $\ldots$ $\swarrow$ $\nearrow$ $\swarrow$ $\lambda_k(0)\&\sigma_1$ $\lambda_k(1)\&s_1$ $\lambda_k(2)\&\sigma_1$ $\lambda_k(3)\&\sigma_1$ $\ldots$ $\downarrow$ $\nearrow$ $\swarrow$ $\lambda_k(0)\&\sigma_2$ $\lambda_k(1)\&\sigma_2$ $\lambda_k(2)\&\sigma_2$ $\lambda_k(3)\&\sigma_2$ $\ldots$ $\swarrow$ $\lambda_k(0)\&\sigma_3$ $\lambda_k(1)\&\sigma_3$ $\lambda_k(2)\&\sigma_3$ $\lambda_k(3)\&\sigma_3$ $\ldots$ $\downarrow$ $\vdots$ $\vdots$ $\vdots$ $\vdots$ $\ddots$

We can then generate $\lambda_{k+1}$ by moving along the diagonal of this table in the order indicated by the arrows. With some reflection, you should be able to see that this will enumerate all of the length-$k+1$ strings. Thus, by induction, for a given $n>0$, we can enumerate all of the length-$n$ stringsover our alphabet. The function that enumerates them will be called $\lambda_n$.

$\Box$

Theorem: The set of strings over a finite alphabet $\Sigma$ is enumerable.

Proof: This proof is almost trivial after the last one, but it makes use of the same diagonalization algorithm as well as all of the $\lambda_i$ functions. Consider the following table:

 $\lambda_1(0)$ $\rightarrow$ $\lambda_1(1)$ $\lambda_1(2)$ $\rightarrow$ $\lambda_1(3)$ $\ldots$ $\swarrow$ $\nearrow$ $\swarrow$ $\lambda_2(0)$ $\lambda_2(1)$ $\lambda_2(2)$ $\lambda_2(3)$ $\ldots$ $\downarrow$ $\nearrow$ $\swarrow$ $\lambda_3(0)$ $\lambda_3(1)$ $\lambda_3(2)$ $\lambda_3(3)$ $\ldots$ $\swarrow$ $\lambda_4(0)$ $\lambda_4(1)$ $\lambda_4(2)$ $\lambda_4(3)$ $\ldots$ $\downarrow$ $\vdots$ $\vdots$ $\vdots$ $\vdots$ $\ddots$

Now, again by travelling along the diagonal (as indicated by the arrows) we can see that all of the $\Sigma$-strings will be enumerated, since each $i^{th}$ row will contain all of the $i$-length strings. Thus we can enumerate all the strings over a countably infinite alphabet.

$\Box$

This isn’t quite the enumeration I alluded to in my last post, which instead of using the diagonalization trick involves an explicit mathematical formula, however this version is much easier to illustrate and does the trick equally well.

This theorem has a couple of interesting applications. As we saw in the previous post, we can consider a proof to be a “string” over an “alphabet” where the symbols are individual sentences. Thus, by this theorem, the set of proofs in a language is enumerable (assuming we have an effective way to decide which strings observe the rules of our proof system). We can get a couple of other interesting interpretations, too:

Corollary: (i) The set of all tuples of natural numbers, (ii) the set of finite subsets of $\mathbb N$, (iii) the set of images, and (iv) the set of all computer programs are all enumerable.

Proof of (i): A tuple of natural numbers can be thought of as a string over the alphabet $\mathbb N$. By our theorem, the set of all tuples of natural numbers is enumerable.

Proof of (ii): The set of subsets of $\mathbb N$ can be thought of as unordered tuples. So long as we allow repetitions in our enumerations, this set is enumerable.

Proof of (iii): An image can be thought of as a string over an alphabet of pixel colours. When we do the diagonalization trick from above, instead of looking at the length of a string, we need to consider two different dimensions: the width and height of the image. This can be done with a simple inverse pairing function applied to the original string length. Thus, the set of all images is enumerable.

Proof of (iv): Similar to a proof, a computer program can be thought of as a linear sequence of instructions. Thus, a computer program is a “string” over an “alphabet” of possible instructions and variables (such as loop controls, conditional logic, arithmetic operations, etc…). Thus the set of all computer programs (in a given language) is enumerable.

$\Box$

(On a personal note, I totally finished writing this just as the podcast I was listening to ended. Great timing or what?)