Web searchers plotting to soup up their engines

If you think the split-second ability of today's internet search engines to fetch millions of web pages related to your topic…

If you think the split-second ability of today's internet search engines to fetch millions of web pages related to your topic is little short of miraculous, you ain't seen nothing yet, according to information retrieval specialists.

Soon the little search box on your search engine of choice may be returning answers to questions, searching television and films by image, performing cross-lingual queries, personalising all your searches by remembering your specific interests, and bringing back information from searches of both local information held on your PC or network and information out on the internet.

"Ten years ago, search engines were all about keyword searches, and you used the search box to find a web page," says Prof Alan Smeaton of Dublin City University's Centre for Digital Video Processing, which has been hosting a five-day conference on search technologies this week.

"Now, though the function is the same, the biggest change is what we use web-searching for" - and what we will use it for in future, he says.

READ MORE

Already people use search engines to answer questions, albeit in a clumsy way. Trying to find what American state Kalamazoo is in? Many people find out basic factual information such as this by simply typing in the name of the city in a search box, knowing many of the web pages returned by a search engine will undoubtedly offer the state as well.

And although engines like Google already allow people to do maths problems (try typing "6+5=" into the search box and you'll see) as well as conversions between, say, metric and imperial measurements, Prof Smeaton says that, in future, questions typed into the search box may some day bring back answers, not lists of webpages.

Prof Smeaton says most of the research to bring about such transformations in information retrieval is happening in university research centres, as opposed to the companies that feature information retrieval tools for the internet, such as Google, Microsoft, Yahoo or Lycos.

"The 'look ahead' of an industrial research centre is shorter," he says, and academic environments can also provide crossdisciplinary expertise.

He also believes the research happening in Europe in information retrieval "is as good or better than what is happening in the Far East or the US". Increasingly, too, he says, academic groups work in partnership with the industry towards common goals.

Hence the value of the conference, the European Summer School in Information Retrieval, a biannual event that brings together industry and academic specialists and PhD students researching in the area.

The talk everyone was waiting for was a two-hour presentation by Monika Henzinger, director of research for search giant Google, who is now based in Lausanne.

The title of the talk was "Web Information Retrieval", and Henzinger was the only person during the week who did not make her slides available in advance. This made the talk the subject of much speculation.

It turned out to be a bit of history, a bit of technological analysis and explanation of how search algorithms work, a bit of theorising about how searches can be made more accurate, and a chance for research students to fire some questions about how Google works or might work in future.

Pre-Google, web users utilised search tools that operated using what she describes as "classical retrieval methods" - type in some keywords, and the engine will return results based on the frequency with which those terms appear on a given web page. Pages are ranked according to the weight given to each search term (a proper name will be given more weight than a title like Mr, for example), which enables the search algorithm to assign a document score based on the sum of the weights of the query terms.

However, the frequency of the search term in a document often does not correlate at all to how useful a document is.

In addition, Henzinger noted that such an approach returns thousands of near duplicate pages as these are not filtered out of results, and also makes it easy for people to use search engine spam, "deliberate misinformation where people are trying to manipulate the function of the search engine to get a higher ranking for a web page".

The weakness of classical methods is due to people making assumptions that all documents on the web are of consistent quality, whereas the web is full of documents of great variety and varying quality, she says.

The Google PageRank algorithm, thought up by Google founders Sergey Brin and Larry Page while graduate students at California's Stanford University, takes a different approach to ranking pages based not just on the appearance of search keywords on the page, but the authority and usefulness of the document as determined by how many other websites link to it, and whether those linking pages in turn are also seen as valuable to other web users.

The algorithm also enables very fast searches because knowledge about the value of pages is already known from previous searches and analysis of the pages by Google webcrawlers (tiny programs that visit and analyse webpages for a database of webpages).

Speed is essential to good searches, says Henzinger. Most people don't want to wait for long searches; they only want to look at the top 10 sites returned, and won't click on any buttons to view further returned pages, with 90 per cent viewing only the first result page. "So the goal has to be precision in the top 10 results," she says - those pages must be as valuable as possible to the searcher.

And therein lies the essence of how Google works and what researchers at Google are trying to do - make searches ever more precise by figuring out ways to circumvent the engine spam (typically, a link to advert sites), remove the near-duplicate pages (up to 30 per cent of the entire web is duplicate pages, she says), and filter results using techniques such as "shingling" (searching for the frequency of precise phrases rather than just words).

Some of the detail thrown out during the talk provided small snapshots of web history - for example, when Google first attempted to rank the top 10 most visited pages on the web in 1999, five of the slots were Microsoft pages, with others including Adobe.com, Netscape.com, and Yahoo.com. In 2005, Microsoft has only one page - a link to its Windows site - in the top 10. The most visited site is extreme-dm.com/tracking, a site for downloading a visitor tracking tool for websites. Second is Google itself, with others including Sitemeter.com (another visitor tracking site), Cyberpatrol.com (a net nanny program), Adobe's site for its Acrobat reader download, and Yahoo.com.

The goal for the future must be increased accuracy and efficiency, Henzinger says, which means experimenting with techniques for filtering and refining results - all of which must happen near-instantaneously and behind the scenes as far as the search engine user is concerned. "Basically, web search engines understand very little - they only look for words," she says.

Maybe so, but the way in which they look for them is becoming one of the tech art forms most appreciated by computer users.

Karlin Lillington

Karlin Lillington

Karlin Lillington, a contributor to The Irish Times, writes about technology