Humanist Discussion Group, Vol. 17, No. 415.
Centre for Computing in the Humanities, King's College London
www.kcl.ac.uk/humanities/cch/humanist/
www.princeton.edu/humanist/
Submit to: humanist@princeton.edu
Date: Mon, 01 Dec 2003 08:06:03 +0000
From: Willard McCarty <willard.mccarty@kcl.ac.uk>
Subject: gender-testing fame: LLC in the Times
Results from work reported in a recent issue of LLC have attracted the
attention of the Times (London) for 22 November, in a review entitled, "A
question of gender: Murder she wrote, or was it he?" The LLC article in
question is, Moshe Koppel, Shlomo Argamon and Anat Rachel Shimoni,
"Automatically Categorizing Written Texts by Author Gender", Literary and
Linguistic Computing 17.4 (2002): 401-12
(http://www3.oup.co.uk/litlin/hdb/Volume_17/Issue_04/170401.sgm.abs.html).
Unfortunately the article is no longer online. Some fairly-used extracts
follow.
The article begins by quoting from the film "As good as it gets", in which
the main character Melvin Udall explains his ability to write in a woman's
voice convincingly: "I think of a man and take away reason and
accountability." That surely gets our attention.... The author then points
to "...the single most mysterious, enduring and vexed question of dramatic
writing: how do men write women convincingly? In this case a truly awful
man writing a well-adjusted woman. The same question declares itself no
less energetically the other way round: how do women write men? However,
once we’ve started down the muddy path into this particular valley of
inquiry, we soon discover ourselves mired in deeper and still more menacing
questions . . . Are we, in fact, kidding ourselves about the whole gender
thing? Are writers really capable of genuine sex-change in their fiction,
or has the history of English literature merely been one long exercise in
furtive cross-dressing?"
Are we all now wondering how we can manage to get an equivalent
introduction to the research we do?
Koppel, Argamon and Fine 2002 is summarized as follows: "It turns out that
the truth the scientific truth is that men are capable only of writing
like men; and women only like women." Along with his colleagues, the Times
author goes on to explain, Professor Koppel "has designed a computer
program that is capable of reading any text of more than a thousand words
written in English and telling you the author’s gender. His results, which
have just been published by the Oxford University Press in the academic
journal Literary and Linguistic Computing, are going profoundly to affect
the study of literature around the world. In short, he has used a computer
to prove once and for all that there is a fundamental and recognisable
gender difference in the way we write.
"Such nuances may not be visible to the reader’s eye but his program sees
them sure enough (the accuracy rate is about 83 per cent). Literary
scholars have spent hundreds of years manually sifting texts without
finding the elusive formula that guarantees such consistent success...."
Using texts from the British National Corpus, the author explains, "Bit by
bit, Koppel’s team stripped out all the subject-specific words until the
remaining copy could be fed back into the computer. Then, as they told the
program which text was male and which female, so the team was able to
construct a mathematical set of rules to ascribe to either gender. This
ascribing slowly became describing.
"Koppel again: 'We would look for unique elements for the women’s stuff and
the same for the guys. To give you a simple example, if the computer found
that women used ‘you’ a lot more than the men, then we’d give it a female
weighting. After a while we got down to about 50 reliable distinguishing
features. We did some more programming. We refined the model. Then we tried
it out on anonymous texts. By the end we were hitting 83 per cent accuracy.
“'And it’s the function words that give the game away. Not the clever or
the topic-specific stuff but the ‘ands’ and the ‘ifs’ and the ‘buts’, the
least significant parts of sentences. Mainly, we used individual words but
also pairs and triples of consecutive parts of speech. But this program is
not about grammar. Actually, the single biggest difference is that women
are far more likely than men to use personal pronouns ‘I ’, ‘you’, ‘she’,
‘myself’, or ‘yourself’. Men, on the other hand, are more likely to use
determiners ‘a’, ‘the’, ‘that’, and ‘these’ as well as numbers and
quantifiers like ‘more’ or ‘some’.
“'And though it might feel a little eerie that people give away their
gender like this, it’s actually kind of obvious when we write we pay
attention to the big words, not the little ones.'”
"But who, I wondered, was among the mistaken 17 per cent? Koppel had kept a
list. P. D. James was the first name that caught my eye. After checking her
novel Devices and Desires, the computer concluded that Baroness James was a
man. Likewise wrongly sexed was David Lodge’s early novel, The Picture
Goers. But the one that really caught my eye was Dick Francis."
The reporter interviews Baroness James, then David Lodge, who is quoted as
saying, "Novels are very problematic texts because they are written in a
medley of styles. And more often than not the author is trying to imitate
some kind of imagined consciousness male or female. Indeed, writers have
always tried to imitate the distinctive characteristics of male and female
discourse and we are in the habit of thinking that they have often
succeeded. But perhaps these scientists believe they can prove this is an
illusion. Still, I’m very surprised that this program is able to discern
the gender of the real author. If you were to take ordinary first-person
texts letters or diaries then you might, of course, expect a fairly
high degree of accuracy. But that it can be done on literary novels
intrigues me. This will have fascinating literary, critical and general
sociological implications. That said, I’d like to see them apply it to a
novelist’s attempt to imitate the opposite sex in a particular passage.”
The reporter goes back to Koppel -- to discover "the program’s Achilles’
heel. When writers move into direct speech they can imitate the voice of
the opposite gender much more successfully, it seemed. Not always, but
often. This was something: a sort of 50 per cent rescue for novelists. But
what of the long passages of prose where the novelist is pretending to be
inside the mind of a character? What about, for example, the last chapter
of Ulysses wherein James Joyce spends 20,000 words pretending to be the
untidy consciousness of Molly Bloom changing direction, interrupting,
digressing.
"On this, Koppel demurred. He hadn’t, he pointed out, tested the program in
such a literary-specific way. In fact, he went on to explain, the whole
text-recognition business is ultimately about the internet that’s the
impetus behind the work and the most obvious commercial application:
refining search-engine accuracy, recognising disguised chatroom entrants
and so on. But, he agreed, it remained rather important for English
literature that such specifics were tested."
Dr Willard McCarty | Senior Lecturer | Centre for Computing in the
Humanities | King's College London | Strand | London WC2R 2LS || +44 (0)20
7848-2784 fax: -2980 || willard.mccarty@kcl.ac.uk
www.kcl.ac.uk/humanities/cch/wlm/
This archive was generated by hypermail 2b30 : Mon Dec 01 2003 - 03:33:07 EST