Compiled by PETER LYNCH
TYPE “OLYMPICS” into Google and a billion links appear in a flash. How is this done? How do computer search engines work, and why are they so good?
PageRank (the name is a trademark of Google) is a method of measuring the popularity or importance of web pages. PageRank is a mathematical algorithm, or systematic procedure, at the heart of Google’s search software. Named after Larry Page, a co-founder of Google with Sergey Brin, the PageRank of a web page estimates the probability that a person surfing at random will arrive at that page. Gary Trudeau, of Doonesbury fame, has described it as “the Swiss Army knife of information retrieval”.
At school we solve simple problems such as this: six apples and three pears cost €6; 3 apples and 4 pears cost €5; how much for an apple? This seems remote from practical use, and students may be forgiven for regarding it as pointless. Yet it is a simple example of simultaneous equations, a classical problem in linear algebra, which is at the heart of many modern technological developments. One of the most exciting recent applications is PageRank.
The PageRank computations form an enormous linear algebra problem, like the apples and pears problem but with billions of different kinds of fruit. The array of numbers that arises is called the “Google matrix” and the task is to find a special string of numbers related to it, called the “dominant eigenvector”. The solution can be implemented using a beautifully simple-but-subtle mathematical method that gives the PageRank scores of all the pages on the web.
The web can be represented as a huge network, with web pages indicated by dots and links drawn as lines joining the dots. Brin and Page used hyperlinks between web documents as the basis of PageRank. A link to a page is regarded as an indicator of popularity and importance, with the value of this link increasing with the popularity of the page linking to it. The key idea is that a web page is important if other important pages link to it.
Thus, PageRank is a popularity contest: it assigns a score to each page according to the number of links to that page and the score of each page linking to it. So, it is recursive: the PageRank score depends on PageRank scores of other pages, so it must be calculated by an iterative process, cycling repeatedly through all the pages. At the beginning, all pages are given equal scores. After a few cycles, the scores converge rapidly to fixed values, which are the final PageRank values.
Google’s computers or “googlebots” are ceaselessly crawling the web and calculating the scores for billions of pages. Special programs called spiders are constantly updating indexes of page contents and links.
When you enter a search word, these indexes are used to find the most relevant websites. Since these may number in the billions, they are ranked based on popularity and content. It is this ranking that uses ingenious mathematical techniques.
Efforts to manipulate or distort PageRank are becoming ever more sophisticated, and there is an ongoing cat-and-mouse game between search engine designers and spammers. Google penalises web operators who use schemes designed to artificially inflate their ranking. Thus, PageRank is just one of many factors that determine the search result you see on your screen. Still, it is a key factor, so those techniques you learn in school for the price of apples and pears have a real-world application of great significance and value.
(The answer to the puzzle is: apples cost 60 cents and pears cost 80 cents.)
Peter Lynch is professor of meteorology at the School of Mathematical Sciences, University College Dublin