11.0327 altered vistas & eroded feet of clay

Tue, 7 Oct 1997

Humanist Discussion Group, Vol. 11, No. 327.
Date: Tue, 7 Oct 1997
From: mgk3k@faraday.clas.virginia.edu
Subject: altered vistas

I came across a short note from Willard on the AHDS server explaining why he
prefers search engines like Alta Vista to manually maintained links on
gateway pages (like Alan Liu's Voice of the Shuttle): largely because the
overhead on gateway pages is simply too high for them to be reliable or
current. I pointed out in a return message that the large search engines are
not nearly as current or comprehensive as we tend to think, and Willard
asked that I post to Humanist. Here's the message that first alerted me to
this problem, from Phil Agre's inimitable Red Rock Eater list:

>Flawed AltaVista Internet Search Engine
>[It seems to me that the whole keyword-based search engine paradigm on the
>Web collapsed back in the fall sometime. At least that's when I stopped
>being able to find anything on the Web using Lycos, Alta Vista, etc unless
>I had an obviously unique set of words to search on, if then. Now that
>the Web has outgrown indexing and search methods that librarians rejected
>decades ago, maybe it will come time to get some serious ideas about the
>subject. We may even have to listen to the librarians' opinions. Now,
>some people are out there trying to catalog the Web using library cataloging
>principles. But (as the librarians well know) that doesn't work because
>URL's are too impermanent; I've given up trying to cooperate with people who
>think they're cataloging Web-based periodicals such as The Network Observer.
>We need some different metaphors for cataloging and for the Web. Once we
>get over this IPO-driven mania about "push" technology, maybe we can get
>back to business and rethink what it means to order information in a totally
>decentralized environment.]
>Date: Wed, 26 Mar 1997 08:20:28 -0500
>From: John Pike <johnpike@fas.org>
>To: pagre@weber.ucsd.edu
>"As web-surfing enthusiasts already know, AltaVista is a program
>that will search the entire Web..." was the way Amy Schwartz
>introduced a review of the new book "The AltaVista Search
>Revolution" on the oped page of the Washington Post ["The
>Information Laundromat" 22 March 1997].
>While AltaVista is indeed an estimable implementation, most
>web.surfers will be astonished to learn that, contrary to this
>conventional wisdom, AltaVista indexes only a small, flawed,
>arbitrary and not even random sample of what is on the web today.
>Estimates of the total content of the web are of necessity
>speculative, but run as high as 150 million pages. AltaVista
>claims < http://altavista.digital.com/ > to be "the largest Web
>index: 31 million pages found on 476,000 servers." So where are
>the missing pages ?? [or as Ronald Reagan asked "where is
>the rest of me??].
>There are many reasons a web page might not show up in the
>AltaVista index. Some parts of some sites are hidden from public
>view with the Robots Exclusion Protocol, which tells search
>engines not to index certain pages. Other types of content, such
>as the Adobe Portable Document Format [PDF] do not currently
>support indexing. Some large sites dynamically generate
>their content, rendering it invisible to search engines. And other
>sites have security access controls which may [or may not!!!
>but that is another story.... ] preclude indexing their pages.
>But surely this does not explain why the estimable AltaVista
>indexes only 20% of the web.
>The AltaVista FAQ sez:
>>How do I submit my site to AltaVista?
>>Use our Add URL feature, found at the bottom of every
>>page. Simply type in the main URL for your site. You can
>>submit several URLs, but it is considered bad taste to
>>manually submit your entire site: just let Scooter do this for you.
>This certainly creates the impression that once AltaVista has even
>one URL from a site, it will automatically [in the fullness of time,
>but that is another story as well....] include the entire site in
>its widely used index. Certainly, this claim is the reason that
>AltaVista is so widely relied upon, and the reason that most
>web.users assume that "if it ain't in AltaVista, it ain't online"
>I webmaster the Federation of American Scientists site,
> http://www.fas.org/
>which is a medium-sized website with some 6,000 pages and about 1/2 Gig
>online. Recently I noticed that the Alta Vista search engine seemed to only
>index about 600 of our pages. I thought that this was rather odd, since I had
>long had the impression that AltaVista indexed pretty much everything, or at
>least made a good-faith best effort to do so. I asked them about this, and
>this is what I got back:
>>Date: Tue, 18 Mar 1997 09:08:39 -0800 (PST)
>>From: Alta Vista Support
>>To: johnpike
>>Subject: Re: AltaVista not indexing www.fas.org
>>That is probably a good estimate...We have 600 pages from you indexed in
>>the system. You will probably not see much more than that for any one
>>domain. Goecities has 300...and they have 300,000 members.
>I confess that I was rather horrified as I contemplated the implications of
>this [which can be verfied by searching AltaVista on < host:geocities.com >
>.... try this trick on your own domain and see what happens!!!].
>For a medium to large site, such as ours, it means that they are only
>indexing some arbitrarily selected subset of our total content. Thus
>corporations, universities, or most other really content-rich sites will be
>poorly represented in their index.
>It also means that for smaller entities that do not have their own domain,
>their content will also not be indexed. As in, are the reported 300,000 users
>of Geocities aware that the fact that their pages are hosted @
>www.geocities.com [or the larger number of folks who are hosted @
>members.aol.com] means that they are effectively invisible to AltaVista, one
>of the most widely used and admired search engines???
>What this seems to mean is that medium-sized sites of a few hundred
>pages are going to show up nicely in AltaVista, but larger and smaller
>implementations will be nearly invisible, which is a rather odd way of doing
>things. I mean, this is sorta like buying a map that shows some arbitrary
>number of roads but doesn't have any of the main interstates, or a phone
>book that only has even-numbered phone numbers, or something.
>I confess that I was not previously aware of this practice of AltaVista, which
>is certainly not been previously reported anywhere, and is certainly @
>variance with their apparent claims that if you supply them with one URL
>from their site they will spontaneously include the rest of their site in
>This is not to trash AltaVista, which at least has an implementation that
>enables one to determine just how many of your pages are in their index [I
>can't seem to make the other engines do this neat trick]. But it is to say
>that anyone whose online presence has been predicated on their entire site
>[large or small] showing up in AltaVista had better think again. And that
>anyone trying to search the 'entire' web [as opposed to some arbitrary
>sample thereof] had best look somewhere other than AltaVista.
>Frankly, I think this is a more significant story than the widely
>reported "flawed Pentium chip" or "browser security flaws" stories.
>These highly visible episodes affected only a small number of
>users, or were more in the nature of theoretical problems. But
>AltaVista claims to be used nearly 30 million times a day,
>so this "undocumented feature" of AltaVista affects nearly
>everyone who uses the web [doesn't everyone???].
>As someone who uses AltaVista many times a day, and whose
>webpresence strategy had been predicated on "If I build it, they will come,
>cause they will find it in AltaVista" this has really come as a shock to me,
>and I imagine that it would come as a shock to many others as well. I
>mean, it is one thing to admit that regenerating a web.wide index takes a
>long time, and that your index goes stale after a month or so, but it is
>another to admit that you are just not even trying to index large sites, or
>small sites that are appended to an ISP's domain, and I am pretty
>To keep track of this issue Melee's Indexing Coverage Analysis (MICA)
> http://www.melee.com/mica/index.html
>examines the relative page coverage for a select group of search
>engines. Each week, Melee Productions will retest the engines
>on the list and publish an update to the MICA Report. They
>will be happy to test any publicly accessible search engine that
>supports date-range and host/domain constraints, and purports to
>index at least one fifth of the "web".
>Stay tooned for further developments!!!
This went out last March, and in my experience Alta Vista is still at least
six months behind the curve. There are some fascinating issues here,
obviously of relevance to many of us. Anyone have an update?

Matthew G. Kirschenbaum University of Virginia
mgk3k@virginia.edu or mattk@virginia.edu Department of English
http://www.iath.virginia.edu/~mgk3k/ The Blake Archive | IATH

