Thursday, July 3, 2008

How Software Agents and Search Engines Work

There are at least three elements to search engines that I think are important: information discovery & the database, the user search, and the presentation and ranking of results.

Discovery and Database

A search engine finds information for its database by accepting listings sent in by authors wanting exposure, or by getting the information from their "Web crawlers," "spiders," or "robots," programs that roam the Internet storing links to and information about each page they visit. Web crawler programs are a subset of "software agents," programs with an unusual degree of autonomy which perform tasks for the user. How do these really work? Do they go across the net by IP number one by one? Do they store all or most of everything on the Web?

According to The WWW Robot Page, these agents normally start with a historical list of links, such as server lists, and lists of the most popular or best sites, and follow the links on these pages to find more links to add to the database. This makes most engines, without a doubt, biased toward more popular sites. A Web crawler could send back just the title and URL of each page it visits, or just parse some HTML tags, or it could send back the entire text of each page. Alta Vista is clearly hell-bent on indexing anything and everything, with over 30 million pages indexed (7/96). Excite actually claims more pages. OpenText, on the other hand, indexes the full text of less than a million pages (5/96), but stores many more URLs. Inktomi has implemented HotBot as a distributed computing solution, which they claim can grow with the Web and index it in entirety no matter how many users or how many pages are on the Web. By the way, in case you are worrying about software agents taking over the world, or your Web site, look over the Robot Attack Page. Normally, "good" robots can be excluded by a bit of Exclusion Standard code on your site.

It seems unfair, but developers aren't rewarded much by location services for sending in the URLs of their pages for indexing. The typical time from sending your URL in to getting it into the database seems to be 6-8 weeks. Not only that, but a submission for one of my sites expired very rapidly, no longer appearing in searches after a month or two, apparently because I didn't update it often enough. Most search engines check their databases to see if URLs still exist and to see if they are recently updated.

No comments: