Home About Subscribe Search Member Area

Humanist Discussion Group


< Back to Volume 32

Humanist Archives: Feb. 5, 2019, 6:33 a.m. Humanist 32.423 - the McGann-Renear debate

                  Humanist Discussion Group, Vol. 32, No. 423.
            Department of Digital Humanities, King's College London
                   Hosted by King's Digital Lab
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org


    [1]    From: Henry Schaffer 
           Subject: Re: [Humanist] 32.417: the McGann-Renear debate (154)

    [2]    From: Gabriel Egan 
           Subject: Re: [Humanist] 32.417: the McGann-Renear debate (52)

    [3]    From: Desmond Schmidt 
           Subject: Re: [Humanist] 32.416: the McGann-Renear debate (189)


--[1]------------------------------------------------------------------------
        Date: 2019-02-04 14:31:14+00:00
        From: Henry Schaffer 
        Subject: Re: [Humanist] 32.417: the McGann-Renear debate

On Mon, Feb 4, 2019 at 2:55 AM Humanist  wrote:

>                   Humanist Discussion Group, Vol. 32, No. 417.
>             Department of Digital Humanities, King's College London
>                    Hosted by King's Digital Lab
>                        www.dhhumanist.org
>                 Submit to: humanist@dhhumanist.org
>
>
>     [1]    From: William Pascoe 
>            Subject: Re: [Humanist] 32.416: the McGann-Renear debate (71)
>
>     [2]    From: Dr. Herbert Wender 
>            Subject: Re: [Humanist] 32.416: the McGann-Renear debate (11)
>
>
>
> --[1]------------------------------------------------------------------------
>         Date: 2019-02-04 00:47:04+00:00
>         From: William Pascoe 
>         Subject: Re: [Humanist] 32.416: the McGann-Renear debate
>
> Hi,
>
>
> Just thinking about how a practical alternative technology to XML for
> marking up
> texts based on this discussion might work, where text = a linear series of
> discrete characters (since there are many such things and it's useful to
> find
> general ways to work with them).
>
> One issue is that there are many heirarchies applicable to any text
> depending on
> what someone is interested in. It's not realistic to apply all this markup
> to a
> single text file, so how would you overlay all these different markups?
>
> Another is that start and end points of features of interest sometimes
> overlap,
> which is a problem for the strictly nested heirarchy required in XML.
>

  This is IMHO a very important point. If our markup definitions don't
reflect our texts, which have to give way? Here's a string of characters
*bbbbbb**iiiiii *where bold (*b*) and italic (*i*) don't overlap and so 
there's no problem with overlapping tags. But what about  *bbbbbb**iiiiii*  
where there is overlap (if the fonts don't come through,
3 bold *b*s, 3 bold italic *b*s, 3 bold italic *i*s and 3 italic *i*s) 
which is well represented by the overlapping tags:
bbbbbbiiiiii

> Why not just leave the text file alone, not put the tags in the text file
> itself, and specify markup in a different file, that has pointers to the
> start
> and end character.

  From my vantage point, there really isn't a difference. The two formats
are "isomorphic" in the mathematical sense. We accept this without noticing it 
with respect to all computer storage of text. The letter "A" isn't an "A" in 
computer storage, it's a group of bits (O or 1) in an cluster (perhaps 01000001) 
which we all agree stands for an "A" and when we tell the computer to display 
it on the screen it shows up as "A".

  Similarly, both bbbbbbiiiiii and something
like bbbbbbiiiiii in one file and bold[0:8] italic[3:11] in another file
represent exactly the same text and are presented with identical appearance 
if the screen allows the four fonts bold-non-italic, bold-italic, and 
non-bold-italic.

> The strictness of the heirarchy would be optional, so for those purposes
> that
> things like DTDs are useful for, you could still have that (such as act,
> scene,
> speaking character or POS tagging), but you could also have complete
> looseness,
> such as annotations on overlapping text segments.

  Yes, it can be done in two files. But it can be done in one file - unless
we have a "rule" that tags are not allowed to overlap. But if we are to honor 
that rule, then surely we need to honor it in the two-file storage method.

  So the problem resolves to the rule of non-overlapping, not to the method
of storage and representation. So we come down to the question of whether we 
continue to use a method if it doesn't comport well with reality.

  However, non-overlapping may not be a problem with respect to
representation, but more one of awkwardness. Here are two tagged representations
bbbbbbiiiiii (overlap)
bbbbbbiiiiii
(non-overlap) which produce/represent identical texts. So, should we even care 
which is used? And similarly each would produce a different second-file 
representation (i.e. tags with pointers), but again, they would produce/represent 
identical texts.

> This external markup file could still even be XML itself, with a lot of
> tags and
> values that point to start and end points, and it's own DTDs for problem
> domains, as normal. Instead of text in it, there is just the pointers. It
> would
> be easy enough to automatically convert a text marked up with XML into one
> of
> these external ones, just by stripping the text and potting a 'start' and
> 'end'
> attribute on every tag.

  Bill, I think I've said above what you are saying, but taken a lot more
space to spell it out with examples!

> That way an interface could slurp in as many of these and highlight/colour
> code
> etc a text with as many different markup files as you wanted to import, if
> you
> wanted to visualise it, or also, to process it, with these different ways
> of
> looking at it from different people's work.
>
> I'd be surprised if someone hasn't already come up with this approach.
>
> A problem with this approach is that the text of interest would need to be
> static, because you make one change and all the pointers point to the wrong
> address, but in many useful cases there are static texts for which this
> would be
> useful. Though some version control system could be brought to bear.

  Here's where the isomorphism becomes useful. If you make one change in the
visual one-file representation, it is automatically changed in the storage
two-file representation. That's what happens today in editing - one 
inserts/deletes one character in a file, and we don't even think about it, 
but the computer has to shift all the succeeding characters in the file and does 
it so successfully in the background that we're oblivious to the work needed.

> So for example, this would be useful for the set of Shakespeare plays.
> Cannonical, unchanging plain text versions are readily established, and
> there is
> much interest in people marking it up in different ways for different
> purposes.
> In your app you could view the text and import/overlay any number of
> scholars
> markup and annotations. Or you could process different exo-markup files to
> datamine for correlations etc.

  I'll go off in another direction. I often work with text files made up of
ASCII characters. There are 128 ASCII characters (or perhaps 256
http://www.asciitable.com/) but not all of them are "printable", i.e. they show 
up on the computer screen as " ", i.e. a blank space. But they aren't a blank/space 
character in storage. To distinguish between these non-printing characters and a 
true space character one needs to peer inside the computer storage (the Unix/Linux 
app "od" will do that.) (The most common reason for doing this is that Microsoft 
Word loves to use these characters outside the standard 128 ones - and this can 
lead to unanticipated results when putting material on the web or doing computer 
processing.

--henry schaffer

--[2]------------------------------------------------------------------------
        Date: 2019-02-04 12:55:41+00:00
        From: Gabriel Egan 
        Subject: Re: [Humanist] 32.417: the McGann-Renear debate

Dear HUMANISTs

In response to my querying ("show us an example") of
Desmond Schmidt's claim that in early modern drama it's
possible not only for a dialogue line to be inside a
speech but also for a speech to be inside a line,
Herbert Wender offers an example from Goethe's 'Faust'.
The example is of a manuscript in which the name
"ROSENKNOSPEN" appears anomalously in the middle
of a spoken line. In an article cited by Wender,
the emendation of this line is discussed, and two
options considered. One is to move the name to the
beginning of the line so it forms a speech prefix,
and the other is to move it to another line altogether.

I'm not seeing how this example illustrates what Schmidt
claimed, which was that it's permissible in early modern
drama for a speech to be inside a line. Rather, it seems
to illustrate that it's possible for a textual witness to
contain error. Far from treating this moment in the play as
an example of a speech being inside a line, the editions of
'Faust' under discussion in the essay Wender cites treat
this as an error to be corrected.

There can of course be a speech prefix (marking the
end of one speech and the start of another) occurring
within a manuscript or printed line. Where these occur,
no early modern dramatist thought that the speech was
inside the line. We know they didn't because when they
came to make the actors' 'parts', each of which contained
all the lines to be spoken by a single character, they
would divide such a manuscript or printed line between
two different 'parts' (different physical documents), one
for each of the two characters. That is, they saw such a
manuscript or printed line of words as really containing
two lines: first the last line of one person's speech and
secondly the first life of someone else's speech. They
treated such a shared manuscript or type line just as we
do today, as being really two lines crammed together.

The context for all this is that I was defending the
claim that texts such as early modern plays really
are an Orderly Hierarchy of Content Objects. The
tree-ness is not merely in the eye of the beholder
as Schmidt claimed

Regards

Gabriel Egan




--[3]------------------------------------------------------------------------
        Date: 2019-02-04 08:11:14+00:00
        From: Desmond Schmidt 
        Subject: Re: [Humanist] 32.416: the McGann-Renear debate

Gabriel,

I haven't counted them but there are probably thousands of such cases
in Shakespeare. An example from Hamlet Act 1, Scene 1:

Horatio: Friends to this ground.
Marcellus: And liegemen to the Dane.

(the pentameter line being split over two speeches)
This example was given by Barnard et al way back in 1988 (LLC 3.1,
26–31). In addition any speech that ends in a half-line followed by
one starting in another half-line provides an example of overlap.

Sure, Shakespeare had in his mind that his plays were composed of 5
acts, each consisting of a number of scenes etc, but when they were
printed they were rendered using lead characters arranged on a
rectangular grid using fonts of different sizes and types without the
need for any explicit hierarchies. And when he wrote them he was not
constrained as we are to create elements that strictly nest. He did
whatever he could with his pen, not his computer.

I chose the Shakespeare example deliberately because you CAN analyse
it with a strong hierarchical structure that almost works. But all you
are doing by insisting on it, is setting up your hierarchy to be
broken by someone else who wants a different analysis. And what do the
hierarchies actually achieve? They don't tell you anything you didn't
already know, and so can be dispensed with. They are just a
requirement of the markup language.

Hugh,

I did not say that semantic markup should be external to the text. I
said that semantic information can be derived from text without using
any kind of markup. Semantic markup in XML is too often focused on the
narrow needs of the people who encoded it, or merely records things
that are self-evident and hence not useful for general search and
retrieval. I was advocating instead the use of concept-mining tools
like Leximancer that can extract meaning from plain text, HTML and the
like. Also, if modern machine learning techniques can translate from
Chinese to English fluently they can also extract meaning from text.
So marking up small amounts of meaning internally or externally to a
text doesn't seem worth the effort to me. I am advocating a much
simpler format for text close to plain text that can be easily mined
for information, that contains only rendering or abstract rendering
information. Deeply structured texts as once provided by XML don't fit
the bill because they mix up the rendering with the semantics and use
too rigid a document structure that invites overlapping hierarchies on
reuse.

On 2/3/19, Humanist  wrote:
>                   Humanist Discussion Group, Vol. 32, No. 416.
>             Department of Digital Humanities, King's College London
>                    Hosted by King's Digital Lab
>                        www.dhhumanist.org
>                 Submit to: humanist@dhhumanist.org
>
>
>     [1]    From: Hugh Cayless 
>            Subject: Re: [Humanist] 32.412: the McGann-Renear debate (56)
>
>     [2]    From: Gabriel Egan 
>            Subject: Re: [Humanist] 32.412: the McGann-Renear debate (34)
>
>
> --[1]------------------------------------------------------------------------
>         Date: 2019-02-02 17:46:09+00:00
>         From: Hugh Cayless 
>         Subject: Re: [Humanist] 32.412: the McGann-Renear debate
>
> This has been one of those threads where I'm torn between responding and
> unsubscribing.
>
> Desmond argues (if I am understanding correctly) that since semantic markup
> cannot perfectly describe what's going on in a text, it's better not to do
> it in the text, and instead to focus on a minimalist production, with only
> the
> necessary features, which can then have different layers of annotation
> wrapped
> around it. I suspect Alex might agree with this approach. In this view,
> TEI/XML
> is fundamentally flawed because it imposes structures on the text that
> aren't
> really there, or are only there in certain interpretations or readings of
> the
> text.
>
> His interlocutors are arguing that this argument confuses format with
> function
> and that nothing stops you from doing a minimal TEI with annotations, or
> deriving a minimal text from a maximally marked up text. TEI is not
> fundamentally flawed because though it can't do everything, it can be a
> foundation for doing practically anything. It just gives you a language for
> making text models, what you say in that language is up to you. I suspect
> Desmond might counter that language influences cognition, and an imperfect
> language may steer you to think in ways that are actually pernicious.
>
> Alex's point about the "quantum" nature of text is well taken, though I
> think perhaps it points more at the character encoding level than the
> markup
> level. In the former, in order to represent his example, I have to decide
> whether the thing in question is a circle, or a Latin o, or perhaps an
> omicron
> or Cyrillic o, or something else entirely. In fact, at the markup level (in
> TEI
> at any rate), there are ways to represent this kind of uncertainty.
>
> But this leads us towards a problem: I've often heard the argument from
> folks
> who do things like machine text analysis that TEI is too messy a format for
> them. And indeed it often is. You can't just derive a token stream from
> many
> TEI documents without first making informed decisions about how to get at
> that
> stream -- normalized or original text? Base text or particular readings? But
> here
> is, I think, where the crux lies: TEI says, if you will, "this whole
> digital
> artifact is the edition, my (the editor/encoder)'s argument about the text.
> The annotators say, "here is the text, and there are my arguments about it.
> You can easily have one without the other."
>
> But can you separate the argument about the text from the text itself? My
> own
> answer to this question is a resounding NO. But maybe that comes from my
> perspective as someone who works a lot with messy and difficult edge case
> texts.
> Likely in many cases it doesn't really matter. In my view the splitting of
> text from argument (or even the idea that you should) pushes you towards
> error
> in the same sorts of ways Desmond believes hierarchy pushes you towards
> error.
> Who's right? I dunno.
>
> Perhaps we just have to be aware that our tools and formats all have their
> benefits and risks and we have to make decisions in the light of that
> awareness.
> My plea would be for more open collaboration and constructive criticism and
> less
> "You're doing it wrong!"
>
> Al the best,
> Hugh
>
> PS I really like Herbert's suggestion of a TEI Guidelines "Dirty Tricks"
> chapter.
>
> --[2]------------------------------------------------------------------------
>         Date: 2019-02-02 10:12:06+00:00
>         From: Gabriel Egan 
>         Subject: Re: [Humanist] 32.412: the McGann-Renear debate
>
> Dear HUMANISTs
>
> I asserted that in early modern drama, "All the
> dialogue lines occur inside speeches, all the
> speeches occur inside scenes, and all the scenes
> occur inside acts, and there are exactly five acts".
> Desmond Schmidt responded:
>
>  > Well, that's your analysis. Another way to analyse
>  > it is to say that the headings for scenes and acts
>  > are simply in italics or a big font.
>
> But as I pointed out, the creators of early modern drama
> repeatedly described their work the way I've described
> it, as a tree, and never once (in the materials I'm
> aware of) described it in terms of the typographical
> representations of the units' headings. Are you
> saying that it doesn't matter how the creators
> thought of their work?
>
>  > In any case your example is not perfect: sometimes
>  > speeches are inside lines and sometimes lines are
>  > inside speeches. How do you explain that?
>
> I believe the claim that "sometimes speeches are
> inside lines" to be untrue. Can you give us an
> example?
>
> Regards
>
> Gabriel Egan


--
Dr Desmond Schmidt
Mobile: 0481915868 Work: +61-7-31384036




_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php


Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.