Lost in Cyberspace; July 1998; Scientific American Magazine; by Hayashi; 1 Page(s)
As any World Wide Web surfer knows, finding information over the Internet can be painfully time-consuming. Search engines such as Yahoo!, AltaVista and Infoseek help, but an improperly honed query can easily result in digital diarrhea-- tens of thousands of Web pages that are irrelevant. A new technique that analyzes how documents posted on the Internet are linked to one another could provide relief. Developed by researchers from IBM, Cornell University and the University of California at Berkeley, the method finds two types of Web sites for a particular desired subject: "authorities" (pages that are cited by many other documents on that topic) and "hubs" (sites that link to many of those authorities).
The system, dubbed automatic resource compiler (ARC), first performs an ordinary Boolean text-based search (for example, locating documents that contain the words "diamond" and "mineral" but not "baseball") using an engine such as AltaVista. After generating a quick list of about 200 pages, ARC then expands that set to include documents linked to and from those 200 pages. The step is repeated to obtain a collection of up to 3,000 locations. ARC then analyzes the interconnections between those documents, essentially giving higher authority scores to those pages that are frequently cited, with the assumption that such documents are more useful (just as scientific papers that are referenced by many other articles are deemed most important). Also, hubs are given high marks for having linked to those authorities.