Humanist Discussion Group, Vol. 32, No. 452. Department of Digital Humanities, King's College London Hosted by King's Digital Lab www.dhhumanist.org Submit to: email@example.com  From: C. M. Sperberg-McQueen
Subject: some notes on the origins of SGML and XML (128)  From: Peter Robinson Subject: Re: [Humanist] 32.451: the McGann-Renear debate (69)  From: Gabriel Egan Subject: Re: [Humanist] 32.451: the McGann-Renear debate (71)  From: firstname.lastname@example.org Subject: Re: [Humanist] 32.451: the McGann-Renear debate (44)  From: Michael Falk Subject: Re: [Humanist] 32.451: the McGann-Renear debate (80) -------------------------------------------------------------------------- Date: 2019-02-13 21:50:06+00:00 From: C. M. Sperberg-McQueen Subject: some notes on the origins of SGML and XML In Humanist 32. 435, Martin Mueller writes: The TEI rules were first expressed in SGML, a technology developed by IBM. This is at least half true. The TEI was indeed first written as an SGML application; a team at IBM did develop a system called GML (generalized markup language); and SGML was in some sense a standardized form of GML. But what GML contributed to SGML is probably better described as an approach or a philosophy of text representation than as 'technology'. The ideas of generic markup were also being developed by others outside IBM: the book designer Stanley Rice; Bill Tunnicliffe of the Graphic Communications Association (a printing-industry trade group) and what became the GCA 'GenCode' (generic coding) committee; Brian Reid (whose 1980 dissertation described a document processing system called Scribe, which used generic markup and later became a commercial product and an inspiration for Leslie Lamport's LaTeX). Some of the most important ideas in SGML were not present in GML, and GML usage looks relatively little like the technological ecosystem that eventually developed around SGML and XML. In Humanist 32.436, Desmond Schmidt writes: XML was invented by IBM and Microsoft, through the organ of the W3C, to serve the needs of web services. Document processing was very much a sideline. This is nowhere close to half true. All the members of the Working Group and 'Editorial Review Board’ responsible for the initial development of XML were document people: Jon Bosak, Dave Hollander, Eliot Kimber, and Eve Maler had backgrounds in technical documentation; Tim Bray, James Clark, Steve DeRose, Tom Magliery, Jean Paoli, and Peter Sharpe had all developed major document editing, document processing, and document publication applications, and I had spent eight years as editor in chief of the Text Encoding Initiative. Some of us were convinced that SGML, being a system for the representation of structured information, with user control over what should be represented and how, had great potential for information of all kinds, including the kinds of client server applications that were later developed under the rubric "web services", but the charter of the WG was "SGML on the Web", and the members of the WG and the ERB all used SGML first and foremost (or exclusively) as a language for documents. Some ERB members like Tim Bray and Tom Magliery were perhaps interested in the Web first and in SGML as a way to develop and improve the Web; the rest of us were by all appearances more interested in our documents. Our goal for the Web was only to make it capable of delivering our documents without unwanted loss of information. After the initial draft was published, during the further development of the spec, the WG continued to grow. It is possible that some of the new members were interested primarily in what later became known as web services, but I don't recall any discussion of issues relevant to that interest. After the XML spec was finished, database vendors and those interested in what became web services took an interest because the notation could in fact be used (as some of us had thought) for non-document information as well, and was superior to the alternatives then on hand. It is possible that Desmond Schmidt and others have mistaken the promotional literature and hype of the late 1990s and early 2000s for serious documentation on the origins of XML. Database vendors, web-services enthusiasts, and programming-language type theorists were all involved in the development of some of the later XML-related technologies like XSD and XQuery, and the tensions between 'data heads' and 'document heads' were palpable in some working groups. Desmond Schmidt also writes: Humanists must follow business. I'm not convinced this is true. It's certainly convenient when we can use off-the-shelf hardware and software, but many of the milestones of computer usage in the humanities involve humanists who developed their own software (and sometimes hardware) when they judge commercial offerings are not suitable. I will mention only the names Susan Hockey, Claus Huitfeldt, Wilhelm Ott, David Packard, and Manfred Thaller. As Jon Bosak pointed out in a talk at the TEI@10 conference in 1998, one reason for digial humanists to care about XML is that even if commercial support and development were to disappear, the format is simple enough that we can write our own software to process it and need not rely on commercial vendors. Nor is the opposition between the needs of humanists and the needs of commercial applications nearly so clear-cut as the quoted sentence suggests: there is no problem in humanistic work with texts that does not have an analogue in commercial or bureaucratic applications, and vice versa. And some technologies (e.g. Unicode and XML) were developed by representatives of commercial interests and humanists working together, attempting to provide results useful to multiple user communities. Desmond Schmidt is quite right to say that XML is no longer as fashionable in non-document quarters as it once was. Those interested in web services have discovered that it has many features like validation and mixed content which are of interest to people interested in texts and which they regard as not relevant for themselves. It has these features precisely because it was developed for documents by people who worked with documents, and not (pace DS) for web services. Those who choose their computing tools based on their current vogue will naturally also choose to migrate away from XML. Those who care about user control of data and suitability of tools for tasks should make choices based on careful examination of the relevant tools and not based on the winds of current fashion. ******************************************** C. M. Sperberg-McQueen Black Mesa Technologies LLC email@example.com http://www.blackmesatech.com ******************************************** -------------------------------------------------------------------------- Date: 2019-02-13 16:50:14+00:00 From: Peter Robinson Subject: Re: [Humanist] 32.451: the McGann-Renear debate Donald Mckenzie tells us, the history of reading is the history of misreadings (or something like that). By a “text-complete theory” I don’t mean that every text is completely described by this theory. Quite the reverse. Just that every text can be represented according to this theory. That is: that every text may be represented as a collection of leaves all present on two unrelated trees (or OHCOs). This is minimal only, if you like: and any kind of work is likely to add far more that to that basic representation. So, let’s push this a little further. The basis propositions again, shared by every text which has existed, does exist, can exist (hence, “text-complete”): 1. All texts are real, in that each and every text is an act of communication present in a physical document 2. Therefore, every text has at least two aspects: it is an act of communication; it has physical properties in terms of the document in which it is present 3. Each aspect may be represented as a OHCO: an ordered hierarchy of content objects, a tree 4. The two trees are entirely independent of each other, and of any other tree hypothesized as present in the text Your first challenge. If this is not correct, you should be able to put up lots of examples (one will do!) of texts of which these propositions are not true. Go ahead. Knock yourself out. (A side note: axiomatically, one document may contain, and usually does contain, multiple acts of communication: hence, multiple texts. Equally, an act of communication may appear in many documents, or many times in one document; acts of communication may be related to each other, as versions or revisions, and may again appear in one document, many documents, many times in one document.) A few more notes. As both act of communication and document may be represented by tree structures, we can use all the well-known tools of tree structures for each aspect of the text. In every text there is, we can create tables of contents of both the document, page by page etc, and of the act of communication, by act, scene, speaker/line. It means we can navigate the trees for each aspect, for line to line, page to page, act to act, act to scene to line, up down and across. Useful. Every text, two tables of content. Further: as each leaf of text exists on both trees, it makes sense — for every text that ever was, is, shall be — to ask these questions: What documents does this act of communication exist in? (what manuscripts contain the Gospel of John?) What acts of communication does the document contain? (what parts of the Bible are present in Codex Sinaiticus?) What parts of what acts of communication are on this page, in this column, in this writing space? What pages of what documents contain this part of this act of communication (what pages of what mss have John 1.1?) Again, all you have to do is find a text for which these are not meaningful questions, and the proposition is disproved. I don’t say, again, that this is all there is to texts. Not at all. Or that there not other phenomena in texts which are not OHCO trees, etc etc. Or that one might indeed dismember and re-member a text in entirely different ways. By the way: this also answers the problem posed by Michael Falk: you encode instances of the act of communication in each document you encounter (e.g. ); you locate each instance in every document and compare them. Add more documents, with the same encoding for the act of communication, and so on. And use CollateX for the comparison if you want really useful results. More reading again.. https://wiki.usask.ca/pages/viewpage.action?pageId=1306492976 shows how we implement this via an ontology into an API. We call every part of an act of communication an “entity”, with the act of communication itself being a single entity. -------------------------------------------------------------------------- Date: 2019-02-13 13:56:16+00:00 From: Gabriel Egan Subject: Re: [Humanist] 32.451: the McGann-Renear debate Dear HUMANISTs Michael Falk gives an eloquent account of the difficulty of recording in a single XML document the changes made to a poem during its revision by its author. To Falk, the alteration seems to sprawl across XML units (in this case, lines) so that to record it the editor must do one of the following: i) break the XML principle of nestedness, or ii) preserve nestedness by treating whole lines as the subjects of revision when in fact only parts of lines were revised, or iii) preserve nestedness by recording a single revision as multiple revisions, each occurring wholly within its own line. It seems to me that in Falk's example the editor might be trying to record more than he knows. He writes that in one revision "the word 'shores' has been moved yet again to another element". What does it mean, exactly, for a writer to move a word? One possibility is that the writer has crossed out "shores" in one place and written it again in another. A second possibility is that the writer has inserted matter before "shores" that closes off the element (say, a line) that "shores" used to be within, making the same inscription of "shores" now appear inside a different line element. In either case, it is not indisputable that the word "shores" before the revision is the same as the word "shores" after the revision. Falk writes that the poet has "moved" the word "shores" but it is equally reasonable to say that the poet has deleted an occurrence of "shores" and made a new one, especially if a new inscription of the letters "s-h-o-r-e" has occurred. Looking at the same phenomenon this way, the revision did not necessarily break the hierarchical nestedness. That is, I'm not clear what would count in this instance as evidence that the author made a "single act of revision" rather than two acts. Indeed, what makes "shores" the unit that was "moved"? We might say that poets think in terms of words, so words are the natural units by which to describe revisions. But it is at least equally arguable that poets think in terms of lines or that they think in terms of phonemes or even individual letters. If a poet deletes "the" in line one and turns "me" in line two into "theme", is this two changes or a single one in which the letters "t-h-e" were moved from line one to line two? It seems to me that the answer to that question is non-obvious and requires us to state more explicitly what count as the units of revision. There really is more than one way to describe the difference between two versions of something. Is not the problem Falk is describing a conflict between a hierarchy he has implicitly (unconsciously?) imposed on the text in order to describe its revision and the hierarchy he inherits from the XML encoding scheme he has selected? If so, that is not in itself a criticism of XML's principle of a hierarchical nestedness. It seems to me that here again XML's insistence on hierarchical nestedness forces us to think more clearly about what we are doing in editing texts and diagnosing revision. Regards Gabriel Egan -------------------------------------------------------------------------- Date: 2019-02-13 10:26:47+00:00 From: firstname.lastname@example.org Subject: Re: [Humanist] 32.451: the McGann-Renear debate On 2/13/19 Hugh Cayless wrote: > Michael Falk's contribution seems to me to exemplify many of the kinds of > error we see in these sorts of discussions > > > > May I just reiterate the point Bill Pascoe made a few emails ago. It is not > > the case that: "3. Each aspect may be represented as a OHCO: an ordered > > hierarchy of content objects, a tree." This is the central weakness of XML > as a universal markup language. It insists on an impossibly strict nesting > > of elements. > > I'm afraid if you're going to make assertions about impossibility, you will > have a very hard time proving them. It is quite possible for a TEI encoding of holograph manuscripts to be so complex that it is practically, although not literally, impossible to edit. That is, it is just as likely to be damaged as improved by any attempt to edit it. If it is shared by a group of editors this level of complexity is reached much sooner. The problem then becomes: how do you communicate your understanding of the "howling wind-storm" of tags that results to your colleagues so they may share your interpretation of the textual phenomena being described? Here is a moderately difficult example. A succession of hired transcribers simply refused to encode this for us. I wonder how hierarchies help us here? http://charles-harpur.org/corpix/english/harpur/A87-2/00000131a.jpg Breaking it down into separate layers as we have done is close to the method Michael describes, and renders the editorial task perfectly manageable. http://charles- harpur.org/View/Twinview/?docid=english/harpur/poems/h509&version1=/h509b/layer- final Desmond Schmidt eResearch Queensland University of Technology -------------------------------------------------------------------------- Date: 2019-02-13 07:30:03+00:00 From: Michael Falk Subject: Re: [Humanist] 32.451: the McGann-Renear debate I appreciate Peter Cayless's response. I overstated my case. I'm not against TEI XML in general. It's a very useful way of encoding the structure of a document, and has some useful metadata standards that I think anyone who's into distant reading appreciates. Nonetheless, a few points: (1) My aim was simply to show that there are common cases in textual editing where the strict nesting of XML elements causes inconvenience. The poet I drew my example from is an extreme case, where virtually every poem exists in multiple manuscripts, and virtually every manuscript has multiple layers of revision. In other cases XML works very well. Some of the problems with my particular example could be fixed if the standard were changed. For instance, pagination creates just the kind of frustrating overlapping of elements that lineation does in my example. I can't access the TEI documentation at the minute, but as I recall, this problem has been overcome by representing page breaks as self-closing elements. This would, however, be quite difficult to do for text elements like lines. (2) Peter makes the same suggestion that different "works" should be treated as different texts and be encoded in different XML documents. This is more or less exactly what I was suggesting. I must say I do not agree that my nonce encoding involved "mashing" three different works together, though he would be perfectly entitled to make that decision in his own edition. It entirely depends what you mean by "work," a word whose definition will always be contested. (3) As for my faith in algorithms, I must declare an interest. The system my colleague Desmond Schmidt has designed for the Charles Harpur Critical Archive appears to work quite well. But I might reframe my point in the light of James Rovira's question and Peter's critique. I do not want to claim that algorithms are the solution where XML is not. It was simply an example of an alternative approach. A database of versions and revisions would be another. These approaches all have their advantages and disadvantages, of course. A database requires a whole software ecosystem to run, where a bunch of XML files are pretty resilient to changing software standards and virtually self-describing. Graph representations like Desmond's allow rapid collation and efficient storage, but are not human readable. As Peter quite rightly says, it all depends on your purpose. (4) So far I agree with Peter and simply wish to refine my argument. But there is one point where we disagree. I don't think it is true that representations are equivalent if they can be transformed into each other without loss of information. Perhaps I misunderstand Peter's point, but this seems to overlook information entropy. It must be obvious that some representations are more efficient than others, and encode the same information in fewer bits. Otherwise it would be "paragraph" not "p", and "line-group", not "lg." But there is also the more important point in practice, that some representations are more laborious to make than others. In many common cases of textual editing, XML is both more laborious than the alternatives, and it would not surprise me if it also required many more bits for the same information (though I could well be wrong there). In other cases, like preparing a well-structured reading text to be rendered on a variety of devices in different ways, it is surely the ideal technology. There is also the simple matter of elegance, which matters because it goes to the interpretability of a representation by a human. I think these points are important, because I have seen projects (I will not name them) that have opted for XML early on, and involve themselves in excruitating labours down the line that possibly could have been avoided. To me it seems that the tree-structure of XML is usually the issue, so that I what I criticised. I don't mean to rant against it. I use TEI XML all the time in my own work. Having the metadata in a standard format, good abstractions of common textypes, and a range of supportive technologies like XPath and XLST make it a joy to use in many applications. I intend only those types of criticism that Peter welcomes. Michael Falk -- Michael Falk Developer and Research Project Manager Digital Humanities Research Group Western Sydney University Living and Writing in the Blue Mountains https://www.michaelfalk.com.au Sent from my phone _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: email@example.com List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php
Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)
This site is maintained under a service level agreement by King's Digital Lab.