Looking for total recall

How it works

We're drowning in words. Countless millions of online documents vie for our attention, but nobody has the time to sift the wheat from the chaff. But while today's search engines are well suited to casual browsers poking around for brioche recipes or osteoporosis information, users who need to digest information scattered across hundreds or thousands of documents are often unsatisfied. The fundamental problem is that search engines return entire documents, rather than summaries.

To the rescue comes the burgeoning field of "information extraction" (IE), which investigates tools to extract documents' essential details automatically.

For example, a government foreign policy analyst might want to see newspaper articles concerning terrorist events. Ideally, he or she would give an IE system newspaper articles, perhaps retrieved by a search engine:

A 32-year-old Belgian fugitive suspected of masterminding a Madrid bus station massacre was arrested in Rome.

Lieven Pittois, who has been at large for five years, was convicted of involvement in the 1977 bombing attack that killed 15 passengers and injured 36.

The IE system would then produce the following summary (or "case frame", in IE jargon):

location: Madrid

year: 1977

perpetrator: Lieven Pittois

method: bomb

target: bus station

dead: 15

wounded: 36

This structured summary can be grasped at a glance, and can be easily incorporated into applications such as databases or spreadsheets. But if information extraction sounds like a tall order, you're right: unlike search engines, IE systems are still research prototypes.

While search engines are usually general-purpose, the key to success is tailoring the systems to specific audiences. Estate agents might want lists of property transactions in their area, while stock brokers need to information about corporate mergers and takeovers.

Ideally, computers would just read the documents. Unfortunately, today's computers barely understand human language. How can the system determine that Madrid (not Rome) is the site of the attack? How can it know that "32-year-old Belgian fugitive" and "Lieven Pittois" are one and the same person?

IE systems rely on extensive linguistic analysis. For example, "co-reference resolution" (determining that Lieven is the fugitive - no one ever accused researchers of insufficient jargon!) involves reasoning about the entities to which phrases refer.

Since Lieven and the fugitive are both people, they might refer to the same entity (unlike 1977 or bus station).

IE systems make extensive use of background knowledge. For example, they must be told that "injured" and "wounded" are synonyms, and that two or more capitalised words might be a person's name. Even painfully obvious facts (1977 is probably a year, while 15 and 26 aren't) must be explicitly stated.

Where does this background knowledge come from? IE systems usually depend on carefully handcrafted knowledge - which partly explains why IE systems are still research prototypes rather than commercial products.

But many researchers are also investigating systems that automatically learn this background knowledge. The basic idea is to give the system a supply of example documents describing terrorist events, each accompanied with the desired IE output.

To work out the effectiveness of a given IE system, the strategy is to give the same document to a person and the system, and compare their summaries. These are usually evaluated in terms of "recall" and "precision". Recall measures the fraction of correct items in the summary, compared with the total number of items (six in the example above). If the system's summary is correct except the year 1977 is inadvertently omitted, recall is 5/6, or 83 per cent.

Precision measures the fraction of the summary's items that are correct. Suppose that besides the missing year, the system listed Rome instead of Madrid as the attack's location. In this case, precision is 4/5, or 80 per cent.

In principle, precision and recall are independent, but researchers have discovered an inevitable tradeoff. IE systems obey the old cliche: "you can't have your cake and eat it too". As an IE system tries to produce more detailed summaries (to increase recall), it is more likely to make mistakes (thereby decreasing precision). IE systems must strike a careful balance, being neither too eager nor too cautious.

Nicholas Kushmerick is at: nick@compapp.dcu.ie

For more information, see http://www.compapp.dcu.ie/

nick/itr/3.html