12.0177 response to The Tagging Challenge

Humanist Discussion Group (humanist@kcl.ac.uk)
Wed, 26 Aug 1998 12:44:08 +0100 (BST)

Humanist Discussion Group, Vol. 12, No. 177.
Centre for Computing in the Humanities, King's College London

Date: Tue, 25 Aug 1998 17:18:02 +0100 (BST)
From: Michael Popham <michael@ermine.ox.ac.uk>
Subject: Re: 12.0169 The Tagging Challenge: an update

I thought I'd respond to Stuart Lee's recent challenge(s) via the list,
despite the fact that our respective desks are only about 20 metres apart!
Readers are warned that this reponse is somewhat(?) *L*O*N*G*, and will
probably be of interest only to those concerned with the debate about the
merits (or otherwise) of using SGML/TEI encoding.

On Mon, 17 Aug 1998, Stuart wrote:

> Is it that certain documents are simply too complicated for SGML-based
> text encoding? If so, at what point does a document cross this boundary?
> I suppose an answer to this might be 'a document is only as
> complicated as you make it'.

I think Stuart has correctly answered his own question -- although I'd
want to put more emphasis on what seems to me to be the essential issue
which underlies his response: namely, that in the process of making *any*
representation of a pre-existing object, the creator of that
representation is obliged to make certain choices (and, therefore,
compromises). Producing a representation by SGML/TEI encoding is certainly
no worse -- and, for some purposes I would argue it's much better -- than
the alternatives currently available (e.g. scanning to produce a page
image). As long as those choices/compromises are documented and/or known
to anyone who subsequently makes use of that representation, then I don't
see a problem.

I've included my quick and dirty efforts to answer Stuart's challenge at
the end of this message, for the "amusement" of whoever cares to look.
*However* before anyone starts scrolling ahead, I'd like to make a
number of points.

I began by using the TEI's Pizza Chef to build a DTD which would support
all the features I thought I would want to mark up. I used Emacs+psgml to
parse and validate the DTD, and then produced a compiled version which I
could use with Emacs to edit the text of the poem. I freely admit that due
to my ignorance of Emacs+psgml, I ended up having to cut and paste the DTD
into the start of my file in order to get the thing to parse and validate
without errors. Total cost of software = $0.00 (or 0.00 ecu if you

In his second message, Stuart summarized his challenge in the following

> So, to clarify, if I wished to record all of the alterations by the
> poet in machine-readable (AND machine-searchable) form what should I do?

.....which made me realize that my DTD was overly sophisticated, and I
could have got away with using just the additional tags for the
transcription of primary sources (but I was too lazy to "rebake" my DTD).
As Stuart wished to record alterations, I felt I could do this primarily
by using just four tags: <add> and <del> (unsurprisingly, for additions
and deletions), plus <addSpan> and <delSpan> -- which would allow me to
encode longer alterations which spanned several lines of text.

How I "cheated"
In his original challenge, Stuart included "the text of the poem" (as
published in Stallworthy, 1983), the URL for a scanned image of the MS,
and a bibliographic reference to Stallworthy's transcription of the MS.
Having the benefit of geography on my side, I was able to go around to his
desk, borrow a copy of Stallworthy's transcription, and use that to inform
my encoding of the MS (I suspect that no-one else who has attempted this
challenge has consulted Stallworthy's transcription -- and in so doing,
have made a great deal more work for themselves, and ignored a
pre-existing and relevant intellectual asset). That said, and copyright
issues aside, I think Stuart should have mounted Stallworthy's
transcription on his website as well as the scanned image of the MS.

By obtaining Stallworthy's transcription, I was effectively stealing the
results of his (Stallworthy's) "considerable intellectual exercise" --
noting that it was the prospects of having to do such (not inconsiderable)
work for themselves, according to his original challenge, which had so
perturbed Stuart and Paul that it had caused them to question the
possibility and worth of using SGML encoding to transcribe such a
document. For the purposes of this challenge, I didn't see the point in
trying to transcribe the MS again from scratch (i.e solely from the
scanned image of the MS), particularly when an eminent Owen scholar such
as Stallworthy had already done the work of identifying the alternations
made to the text, what words had been added/deleted, and so on.

Thus in the process of encoding "the text", I was continually switching
between the MS image that Stuart had supplied, and the printed text of
Stallworthy's transcription. If you've read this far, then it will no
doubt be obvious to you that Stallworthy's transcription (dare I say
"encoding"), whilst it did an excellent job of recording many features of
the MS (especially the poet's alterations to the text), was clearly the
product of certain implicit editorial choices (i.e. particular features of
the MS had not been transcribed, such as the fact that it was written in
blue pencil, the relative weights and exact positions of some of the
alterations, the exact path of crossings out, the fact that some words
have been struck out horizontally, diagonally, scribbled out, etc. etc.).

Therefore, my encoding of the MS was both informed and guided by the
"considerable intellectual exercise" that Stallworthy had invested in his
transcription -- and I did not feel that it would be appropriate for me to
question his decisions about the alternations made to the text. The only
exception to this was the phrase "life-warm", which occurs 2/3 of the way
through the text of the MS -- which Stallworthy has transcribed as
unaltered, although it looks to me (using the scanned image of the MS),
as if it has been faintly struck out.

My SGML/TEI encoded transcription of the MS is intended to meet Stuart's
request that the results of any encoding should be "in machine-readable
(AND machine-searchable) form" ......which is another way of my saying
that it's not very pretty to look at, nor is it supposed to be. SGML
encoding is meant to facilitate information interchange between different
machines and applications over time -- but I'd still want to point out
that a human being is likely to be able to get more out of looking at a
"raw" SGML file than, say, a raw JPEG file containing an image of the same

Although I tried to use the attribute values of the <add> and <addSpan>
tags to give some indication of where on the page such additional text was
placed, I never intended that any piece of software should be able to
recreate a "reasonable" representation of the MS, not even to the standard
of Stallworthy's transcription. Besides, Stuart has already produced an
excellent image of the relevant MS, and accurate representation of the
visual representation of the document wasn't the purpose of this exercise.

Instead, what I wanted to transcribe was the apparent nature of an
alteration (drawing on Stallworthy's interpretation) e.g. whether a piece
of text had been added and then deleted, which groups of words/lines
appear to have been added/deleted together, and so on. Despite the fact
that the resulting (*machine-readable*) text is peppered with tags, I
believe that there's enough encoding present to allow suitable software to
answer such questions as: "Where does the author delete a word, then
replace it with the same word?", "Are there more supralinear additions
than sublinear additions?", "How many words of the text have not been
altered?", "What's the ratio of added text to original text?" etc.

Whilst none of these questions might be particularly interesting in the
case of a single MS, I can imagine that if all the existing MSs of Owen's
work were similarly transcribed (perhaps drawing on Stallworthy's "Wilfred
Owen: The Complete Poems and Fragments"?), it would be possible to
identify which of the MSs had been most heavily revised, and perhaps tie
this in with known patterns of creative energy or stress in Owen's life --
or search the corpus of MS with questions such as "Did Owen ever use, then
delete, the phrase 'glorious war' in any of his poems?").

I have absolutely no idea whether or not these are the kinds of questions
that Owen scholars do, or may ever want to ask -- but having access to
well-encoded machine-readable transcriptions of his work would make such
things possible.

Building the DTD and producing an SGML/TEI encoded transcription of the MS
took only slightly longer than it took to produce this email! (but maybe
I'm just a slow typist).

I don't make any claims for the quality of my SGML/TEI encoded
transcription, other than the fact that I believe it addresses the
essential point of Stuart's challenge. I'm sure there are many vastly
superior ways of producing much more useful, machine-readable
transcriptions using SGML and the TEI scheme in particular -- but I leave
that to others more knowledgeable.

If I had been working *solely* from the scanned image of the MS then (a)
things would have been alot more difficult because I wouldn't have been
able to draw on the expertise and knowledge of Owen's work that
Stallworthy brought to his transcription; (b) despite the image being of
reasonable quality, the resolution wasn't sufficient to resolve certain
textual issues which I was only able to solve by turning to Stallworthy's

However unpleasant the appearance of the raw, machine-readable SGML/TEI
transcription of the MS that I have produced, it is machine-searchable and
thus offers functionality which is not available through either the
scanned image or, indeed, Stallworthy's conventional print-based
transcription (though of course with a certain amount of effort, it would
be possible for a researcher to answer the sample questions I posed above
by working from high-quality print-based transcriptions).

The SGML/TEI transcription contains some basic (okay, *very* basic) header
information. This forms an essential part of the transcription, and
without it, I would not have produced a conforming TEI document. The
scanned image contains no equivalent metadata -- and although it could be
attached in some fashion, there would always be the danger that the two
may become separated.

The resulting machine-readable/searchable SGML/TEI transcription,
including the DOCTYPE declaration is only 4KB in size, whilst the
mid-resolution non-machine-readable/searchable JPEG scanned image weighs
in at 174KB. Even so, the two items could be made available electronically
in a package less than 200KB in size ..... which represents a lot of
intellectual content when compared to the size of some of the junk flying
around the net. Moreover, by virtue of this posting to HUMANIST, more
copies of this electronic transcription of MS 43720 probably now exist in
the world, than do copies of Stallworthy's printed transcription (although
I wouldn't necessarily argue that this is a good thing!).

If Stuart believes that my SGML/TEI encoding of the text has not captured
"...all of the alternations by the poet", then I suspect that the fault
lies with me and my ignorance of the TEI, rather than an inherent failing
of either TEI or SGML. Even if it were to be shown that the TEI is *not*
capable of capturing the information that Stuart desires, that only
suggests to me that the TEI needs to be revised or extended -- rather than
junked wholesale. Futhermore, if the TEI could not be so adapted, then
this would still not have shown the limitations of using an SGML-based
markup scheme, but would simply imply that an alternative markup scheme
should be devised. Whether using the existing TEI scheme, adapting or
extending it, or replacing it with anything else might make the work any
less "difficult", seems to me to be an issue best left to the individual
encoder/transcriber to judge.

An SGML/TEI Transcription of the MS
Whatever the merits of my response to Stuart's challenge, I wouldn't want
to pretend that there is a definitive answer to his suggestion that there
may be material that is "...too difficult to mark-up using the TEI

I'm sure there are TEI zealots who would happily tackle the encoding of
any document you'd care to throw at them (which *isn't* an open
invitation!) -- but I have a sneaking suspicion that the terribly dull
answer is that it will all come down to a straightforward cost/benefit
analysis, and that the "cost" (difficulty expressed in terms of time,
money, learning curve etc.) must be weighed against the anticipated
benefits (machine readability/searchability, non-proprietary encoding,
capturing an interpretative encoding of a text, drawing on the work of
TEI etc. etc.).

So, for those of you that have read this far (or simply jumped ahead from
the start of this message!), here's my attempt at producing one possible
transcription of the MS. I don't claim that it's particularly good or
pretty, but I hope it shows the general principle that (in this case at
least), it's possible to produce a machine-readable/searchable SGML/TEI
transcription of the MS which captures all the alterations made to the
text. (NB. I don't think it would be useful to get into a discussion about
how good, bad, or indifferent my encoding may be -- but I would be happy
to see people continuing this thread in general terms, responding to
Stuart's challenge).

I apologize in advance for any errors which this transcription contains,
but I hope I've done enough to demonstrate the principles, answer Stuart's
challenge.....and win the free book!

<!DOCTYPE tei.2 PUBLIC "-//TEI P3//DTD Main DTD Driver File 1994-05//EN" [
<!--* Base tag set *-->
<!ENTITY % TEI.prose "INCLUDE" >

<!--* Additional tag sets *-->
<!ENTITY % TEI.linking "INCLUDE" >
<!ENTITY % TEI.analysis "INCLUDE" >
<!ENTITY % TEI.certainty "INCLUDE" >
<!ENTITY % TEI.transcr "INCLUDE" >
<!ENTITY % TEI.textcrit "INCLUDE" >
<!ENTITY % TEI.figures "INCLUDE" >

<!--* ISO Entity sets *-->
<!ENTITY % isolat1 PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN" >

<title>Futility MS 43720 f53.v as appearing in (<title>Wilfred Owen: The
Complete Poems and Fragments</title>)</title><title type="subtitle">A
machine-readable transcription</title>
<author>Owen, Wilfred 1893-1918</author>
<editor>Stallworthy, Jon</editor>
<authority>Wilfred Owen Estate</authority>
<distributor>Dr Stuart Lee</distributor>
<bibl><author>Owen, Wilfred 1893-1918</author><title>Futility</title></bibl>
<head rend="underline">Futility</head>
<lg type="paragraph">
<l><delspan type="strikethrough" to="a01">Move him into the sun. -<add
place="right"><del type="strikethrough">and his brow's snow<add
place="sublinear"><del>The snow will melt soon</del></add></del></add></l>
<l>Gently its touch awoke him once</l>
<l><del>At home whispering of</del><add place="supralinear"><del>Easily
called him to</del></add><add place="sublinear">Called him out to</add>
fields of half-sown.</l>
<l>Always it woke him, even in France,</l>
<l>Until this morning and this snow.</l>
<l>If anything might rouse him now</l>
<l>The kind old sun will know.</l>
<l>Think how it wakes the seeds, -</l>
<l>Woke, once, the clays of a cold star.</l>
<l><delspan type="overstrike" to="a02"><add
place="supralinear">Are</add><del>Are</del>limbs, perfect <add
place="supralinear"><del>almost</del></add><del>at last</del>, and sides
<l><del>Almost life-warm</del><add place="supralinear"><del>And heart still
warm</del></add><del>too hard to</del><add place="supralinear"><del>it
cannot</del></add><add place="sublinear">too
hard</add>stir<del>?</del>.<anchor id="a02"></l>
<lg><l><addspan place="bottom" to="a03">Are limbs - <del>pricked</del><add
place="supralinear">bled</add> with a little sword</l>
<l>Yet limbs - still warm - too hard to stir?</l>
<lg><l><addspan place="bottom" to="a04">Are limbs, <del>so</del> ready
for life, full grown,</l>
<l>Nerved and still warm, too hard to stir?<anchor id="a04"><anchor
<l>Full-nerved, still warm</l>
<l>Was it for this the clay grew tall?</l>
<l><add place="left">O</add><del>O</del>What<del>fatuity</del> made
<del>the</del> <add place="supralinear">fatuous</add>sun <del>toil</del><add
place="right">beams toil</add></l>
<l>To break earth's sleep at all?<anchor id="a01"></l>

Michael Popham - Head of the Oxford Text Archive
Oxford University Computing Services
13 Banbury Road, Oxford, OX2 6NN, United Kingdom
TEL: +44-(0)1865-273238 FAX: +44-(0)1865-273275
WEB: http://ota.ahds.ac.uk/

Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>