19.668 amazon.com does text-statistics

From: Humanist Discussion Group (by way of Willard McCarty willard.mccarty_at_kcl.ac.uk>
Date: Sun, 19 Mar 2006 08:56:41 +0000

               Humanist Discussion Group, Vol. 19, No. 668.
       Centre for Computing in the Humanities, King's College London
                     Submit to: humanist_at_princeton.edu

         Date: Sun, 19 Mar 2006 08:51:14 +0000
         From: Ken Friedman <ken.friedman_at_bi.no>
         Subject: amazon.com does text-statistics

Ideas & Trends

Book, How Do I Love Thee? Let Me Count the Words

Article Tools


Published: March 19, 2006, New York Times

WHO would compare "The Story of Babar" to the prize-winning novel
"Everything Is Illuminated"? Who would call James Joyce's "Ulysses,"
the bane of many an undergrad, a work for a seventh grader?

With the aid of software at Amazon.com known as Text Stats, anyone
can make such comparisons, which are based on the crudest sort of
computer analysis of a book: how many big words there are, and how
long the sentences run.

Such simple statistical scrutiny has been around for decades - used
to determine a book's appropriateness for a certain grade level,
among other things. But software like Amazon's automates the process,
and the Internet lets anyone see the results.

To what end? ask some literary scholars, who see such techniques as
little more than superficial gimmicks. But others say they are a tool
to gain insight into the authorship of and influences on a text,
whether the work of Bob Dylan, Shakespeare or your average high school student.

When Amazon gets the right from a publisher to let readers "search
inside" a book, Text Stats tallies the average length of a sentence
and amasses little piles for each word used. (Or big piles, as in the
case of the King James Bible, for example, where the count for "loin"
is 1,548; "behold," 1,426; and "lord" 7,082.) The software then ranks
a book for clarity and ease of reading on a variety of indexes.

For example, "The Story of Babar" has a Flesch-Kincaid Index score of
6.1 (sixth-grade level), the same as "Everything Is Illuminated" by
Jonathan Safran Foer. Their "fogginess" quotients, an index similar
to Flesch-Kincaid, are very close, too, though the Foer book is
slightly less clear - 8 percent of its words are "complex," compared
with 7 percent for "Babar." Text Stats also produces concordances,
lists of the 100 most-used words in a book.

It is no surprise that the ratings made by computers, and the
connections between books that they reveal, are often bizarre, since
the software is not concerned with meaning and context and is
unaffected by subjective factors like author reputation.

"It's machine reading; it is the kind of reading no one person can
do," said Ben Marcus, director of the graduate fiction program at
Columbia University and a novelist whose works are not accessible to
Amazon's computers. "I think it is really fascinating, anything that
takes us closer to a text, that makes us aware that it is put
together to create an illusion."

The flaw is obvious, too. "The computer doesn't recognize how
sentences relate to each other," he said. "Gertrude Stein or Beckett
may write in elementary sentences, but they take such huge leaps
between them." But that thickheadedness can be useful, some scholars say.

In "Alice in Wonderland," for example, a statistical study can "place
this text against a large collection of 19th-century fiction to see
which other works it resembles on a stylistic basis - what genre does
it fit best, judging, say, from patterns of use of very common
words?" Hugh Craig, who teaches at the University of Newcastle in
Australia, wrote in an e-mail message. "But it would be essential to
do the reading and analysis in the normal way as well, to see what it
is that makes the patterns."

Richard Abrams of the University of Southern Maine said that he could
get the big picture of a writer from statistical analysis. In
preparing for a seminar on Mr. Dylan's lyrics, he said, he found it
useful to consult a concordance of the 10 most used words in the
lyrics, which included, he said, "babe" and "dark."

"For someone who had Dylan on the brain, there was an absolute sense
of familiarity," he said. "You knew you were looking at a Dylan
favorite word list, it showed Dylan as a Romantic."

Still, statistical analysis like this can bring to mind the reported
critique of Mozart by the Austrian emperor Josef II: "too many notes."

Helen Vendler, the Shakespeare critic at Harvard, had not heard of
Text Stats but speculated that "people will get bored by it -
especially if it insults your intelligence by saying 'Ulysses' is at
seventh-grade level." Likewise, she said a "concordance is not
particularly interesting reading."

Amazon says it likes Text Stats because it keeps readers at the site
longer comparing and contrasting books. "It is definitely a feature
that we view as having a 'sticky' aspect," said Brian Williams, the
senior product manager in charge of the Text Stats functions at
Amazon. Mr. Williams said he had heard complaints about the rating of
"Ulysses" but explained that Text Stats was "just one tool." He said
he had read blog postings from authors discussing their score, always
tongue and cheek. "It should be tongue and cheek," he said.
Received on Sun Mar 19 2006 - 04:12:57 EST

This archive was generated by hypermail 2.2.0 : Sun Mar 19 2006 - 04:12:58 EST