Systems get an English lesson

Twas brillig and the slithy toves did gyre and gimble in the wabe.

Twas brillig and the slithy toves did gyre and gimble in the wabe.

Anyone who has struggled with Lewis Carroll's Jabberwocky knows that a word can sound fine but won't be found in any dictionary of English. A computer spell-checker underlines most words in the poem, yet they slide pleasingly past the ear, or past the eye. They may be incorrect, but they don't feel "wrong".

It is much easier if we encounter something like tlnirrig, which we immediately identify as not being a word in English. We cannot even pronounce it. The difference between brillig and tnlirrig is that the first conforms to the constraints placed on the combination of sounds in the English language and the second does not. Such constraints are called phonological constraints, or phonotactics. Word-forms that meet the constraints, but which are not found in the mental lexicon of a native speaker of English, are regarded as idiosyncratic, accidental gaps which indeed may become new words of the language in the future. Such potential words can exhibit regularity and conform also to morphological (word structure) constraints of the language; an English speaker might identify toves as the plural form of some new word tove.

One of the major problems in speech technology is the treatment of new words. In the context of speech recognition, however, "new word" usually means that it is new in relation to a particular body of data or corpus. The reason for this is that most modern speech recognisers are trained on a particular corpus which then forms the basis of a statistical language model; a word which is not in this model is termed a new word. It is now generally accepted that morphological and phonological constraints are required in speech technology applications. Statistical approaches to speech recognition are at the forefront of current research and form the basis for most commercial applications, but the potential for phonological theory to improve the performance of speech recognition has not been fully exploited.

READ MORE

A speech utterance is not just a sequence of words with pauses between them, but rather can be likened to a string of characters of a text without any gaps. A native speaker of English might eventually recognise the string "themoreitsnowstiddelypom" as a line from A.A. Milne's The House at Pooh Corner, but there are a number of possible segmentations into the individual words: its now; it snow etc. Our implicit knowledge of the phonotactics of English will allow us to exclude the form tsnow. However, the fact that usually we only have the speech and not the accompanying text makes the recognition process all the more difficult.

Phonotactic constraints are often not sufficient to resolve all ambiguities and other knowledge sources such as prosody ('record or re'cord) and syntactic context (noun or verb) must also be taken into account. Even then, some ambiguity remains. The old Two Ronnies TV sketch in the hardware store relies on the ambiguity of speech and dialect variation when they confuse "four candles" with "fork handles".

Another hurdle for recognition is the variability of speech. An isolated utterance of b may sound similar if repeated, but the realisation of b in boot is quite different from b in bat. In the first case, the lips are rounded while the b is spoken, anticipating the o, and in the second case they are spread, anticipating the a.

This leads to an overlap of properties (co-articulation) and thus to different acoustic representations both of b and the neighbouring sounds in each case. This, together with the fact that variability can also be found among speakers, means that a rigid segmentation of the speech signal into strictly non-overlapping units cannot account for the full variability of speech.

A computational linguistic approach to speech recognition is being pursued by a research team at the Computer Science Department in UCD. This was the subject of a presentation at the recent Royal Society meeting Computers, language and speech, and it offers solutions to the problems of how to process words which have not been heard before and how to develop fine-grained knowledge representation and processing techniques for linguistic units smaller than the word. The main aim of the work has not been to build a speech recognition system which can compete with statistical systems in terms of system performance. Rather the aim is to design a knowledge-based component for a speech recognition system. to help to recognise new words, model and investigate co-articulation effects, and deal with uncertain structures.

This knowledge-based approach uses a complete description of the phonotactic constraints of the language to distinguish between actual (those which are found in the lexicon) and potential (i.e. new) forms. In contrast to statistical approaches to speech recognition, the speech signal is interpreted in terms of overlap and precedence between properties. This avoids a rigid segmentation into non-overlapping units and allows the variability of speech to be modelled. This could mean that the frumious Bandersnatch will be treated no differently from the indigenous boarfish in speech recognitions systems of the future. More accurate and useful speech recognition should result.

Dr Julie Berndsen lectures in computer science at UCD. Further information on her work is available at www.cs.ucd.ie/staff/berndsen/