Turing Machines and Natural Language

Author: Peter Bradley

The Turing machines we have studied are all fine and good if you're only interested in doing simple arithmetic. But what if you want to understand language?

Take a grammatically simple rule like the formation of questions in English. If I say:

It is a simple matter to form a question from that statement:

The transformation rule applied here could certainly be adopted into a machine table: 'Move to the right until you find a verb. Print out that verb, and move to the left until you hit the beginning of the sentence. Then move to the right, printing everything you find except the verb.'

The machine starts in state one, in which each instruction says 'move to the right and stay in state 1', except the instruction for a verb. That instruction says 'prints the verb, move to the left, and go into state 2'. State 2, like state 1, says, for everything other than punctuation (which includes 'X'), move to the left. The punctuation at the beginning of the sentence sets the machine into state 3, which says 'move to the right, printing anything other than a verb, until you hit an punctuation mark, then stop'. Take some time to experiment with different sentences made up of the words in the 'database' displayed here. The following sentences should be transformed correctly:

Joe is Bob. => is Joe Bob?
The rose will be in the vase => will the rose be in the vase?
Joe will go with Bob. => will Joe go with Bob?

You'll notice a few differences between this Turing machine and the ones we've already covered - for starters, because we wanted to parse English sentences, we had to create a Turing machine that reads and writes words instead of '1's and '0's. But that raised an immediate difficulty. All the Turing machines we have studied so far are defined by their machine table, and the machine table's columns correspond to the possible marks the machine can find on the tape. As the number of possible words is infinite, a true natural language processor would have to have an infinite machine table — and that is impossible.

The first step in creating a natural language parser is to come up with a way to categorize the infinite set of possible inputs (words) into a finite number of groups. Fortunately, there is a clear choice: parts of speech. We (human) speakers of English understand a sentence not just because we understand the meaning of the words that make it up (called the sentence's 'semantics'), but also the way in which those words are organized (called the sentence's 'syntax'). Take an example: ff I say 'Silence is foo', as Daffy Duck once did, you may not know what 'foo' means — that is, you may not know the semantics of the word 'foo' — but you do know, because of the verb 'is', that I am asserting that silence has the property of being foo. The syntax of the word 'foo' — that is, the role it plays in the sentence — gives us a clue about its meaning.

For example, add a word to the known word database, and use that word in an input sentence. As you might expect, the machine 'knows' nothing about the semantics of the word. It simple matches the symbols on its tape to the symbols in the known word database. Try adding nonsense words like 'xyzzy' and 'foobar'. The machine accepts these like any other.

The machine that we're creating knows nothing of semantics, but that's not really a problem. The process by which we turn declarative sentences into interrogative sentences is a purely syntactic one: I don't need to know that 'foo' is to say 'Is silence foo?' in response to Daffy's assertion that 'silence is foo'. Therefore, in order to parse natural language with a machine table, we need to write a table with columns corresponding to syntactic structure of the sentence, not the semantics. Again, the different parts of speech suffice.

There is one further complication. A Turing machine can only match symbols to other symbols. In order to determine which symbols (words) belong to which part of speech, we need to create a database that maps common words to the parts of speech represented in the machine table.

Take another look at the machine above. It's machine table has 5 columns, rather than 2, and the tape contains words, rather than symbols. When this machine reads the tape, it doesn't simply match the value on the tape to the value in the header row of the machine table. Instead, the machine checks the word read off the tape against its database of known words to find what part of speech this word is. Then, and only then, does it check the header row to find its instructions.

When you were playing with the machine above, did you find any sentences for which it fails to produce the correct interrogative output? What about:

The correct corresponding question is:

Not, as the machine produced:

This is incorrect. To fix this problem, we'll have to create a new machine table column called 'intransitive verbs' with the instruction 'instead of printing the value on the tape, print 'did', and then, when you return to this position, print the present participle form of the verb'. In order to add this rule, however, we'd need to add one additional feature to our Turing machine: a place to store the information about what to do 'upon return' — and that requires something that our machine lacks: memory.

Now consider an even more difficult case:

Try it in the machine above. The output should be:

A clear mistake. And, interestingly, one that real native English speakers never make — even when children! Why?

Well, it's because 'that is red' is functioning as a noun-phrase. It is read as attached to the noun 'rose'. When we form the appropriate question from 'The rose that is red is in the vase', we skip over the first 'is', as it is a part of the noun-phrase and concentrate on the second, forming:

How could we write a machine table to handle this? Well, I don't believe we can. If we write a table that moves only the second verb encountered, we would, for example, get:


instead of

What the machine needs is a implicit (or, as Chomsky might say 'innate') understanding of the structure of the grammar. It needs to know that sentences have a subject-predicate structure, and that the noun-phrases one encounters immediately following the subject are a part of the subject, not the predicate. Here is one such 'higher-level' language processor:

A Simple Natural Language Parser

You'll notice that while this parser does fairly well, it still can't handle sentences with relative clauses like 'The rose that is in the vase is on the table.' For that, we need to do a little something more:

A More Complex Natural Language Parser

There are a couple of hacks that need to be pointed out. First, to solve the first problem noted above, if the verb found is not a form of 'to be', and it does not have a modal auxiliary in the sentence, it is given an implicit 'does' as its modal auxiliary. This doesn't solve the problem completely (it produces 'Does Bob wept' from 'Bob wept' instead of 'Does Bob weep'), but we don't have the complexity to program into the system verb form at this point. Second, to solve the second problem noted above, when the machine head encounters a relative pronoun like 'that', the machine assumes the beginning of a clause, marking the tape with an 'X' and assigning all tape blocks a value of 'null' until it reaches a verb or punctuation (as specified in the table). This allows a special instruction to operate (although it does not appear in the table) that says: if you hit an 'X' with a value of 'null', keep doing what you were doing before'. Marking the tape in this way effectively removes the phrase from the processed sentence, thus blocking it's interference with the main verb.

This system is not perfect - far from it. On the other hand, where it fails is, perhaps, what is most interesting about this machine. Try constructing your own sentences out of the machine's known word database. Note where, if anywhere, it goes wrong. You should notice that the machine is incapable of dealing with adjectives or adverbs (there is no 'adj' or 'adv' column in the machine table). Given the words loaded into the database, this is not that big of a problem, but consider the following sentence:

That sentence is readily parsed by the Turing machine. But what about the sentence:

This sentence is not. In order to parse the first sentence, the machine must assume that a noun following a noun (with no articles) is the object of the sentence, while the first noun is the subject. The name 'John Smith' is understood by native English speakers as one noun — but that is not evident in the 'mere' surface grammar of the sentence. A fully-fledged natural language processor would have to have some mechanism for recognizing Proper names as a single noun, rather than a subject-object pair.

Second, the machine matches words on its input to words in its database to determine the part of speech. But many of our words in English (for example 'rose') can be more than one part of speech. It can be a noun, as in "The rose is in the vase", or it can be a verb, as in "The sun rose this morning". If you add 'rose' to transitive verb list, the machine will fail to parse the sentence correctly. Consider this (rather famous) example:

It may be difficult to parse, but it is a sentence in English. In English we often take place names (such as 'Buffalo' meaning Buffalo, NY), and turn them into adjectives. Thus, if Joe is from Buffalo, we might call him 'Buffalo Joe'. But 'buffalo' is also a noun, referring to the animal. Thus, a buffalo from Buffalo can be called 'Buffalo buffalo'. The final step is to notice that 'buffalo' is also a verb: it means 'to bluff aggressively'. Thus, the sentence 'Buffalo buffalo buffalo Buffalo buffalo'. Means:

'Buffalo buffalo buffalo Buffalo buffalo' is totally unparsable by this machine.

Third, natural language is recursive. Consider the sentence:

That sentence (if you add the words 'Silence' and 'foo' to the noun database) is readily parsed by this Turing machine. But what about the sentence:

This sentence is parsable, thanks to our 'noun-phrase hack' described above! But what about:

It's not. Why not? Because it can only parse sentences with one phrase. The machine is not a recursive function, so it cannot handle sentences-within-sentences-within-sentences...

It cannot, therefore, handle:


All of which (and many more) are sentences in English, readily understood by native English speakers.

What is needed is a truly recursive function. Instead of jumping over a phrase that starts with 'that', it needs to start over at the beginning of the table, and treats what comes next as a sub-sentence. This one does just that:

It will parse an any number of sub-clauses, and even create the appropriate questions - but they may fall outside of the visible area of the flash movie. There are a couple of cheats: (1) I hold the sentences in a memory array, which is visually demostrated here by the heirarchical sentence diagrams. (2) the machine places nouns in the object or subject category depending on whether or not the verb is visible in the grammar diagram. This parser is still not fully funcional with a natural language, but it is harder to break than the others. And it is remarkable that a simple 9-line machine table, with an embedded 'innate' grammatical structure can parse such a wide variety of English sentences. Here are a few to try:

Copyright: 2002