Home About Subscribe Search Member Area

Humanist Discussion Group

< Back to Volume 34

Humanist Archives: June 19, 2020, 10:21 a.m. Humanist 34.123 - annotating notation

                  Humanist Discussion Group, Vol. 34, No. 123.
            Department of Digital Humanities, King's College London
                   Hosted by King's Digital Lab
                Submit to: humanist@dhhumanist.org

    [1]    From: C. M. Sperberg-McQueen 
           Subject: Re: [Humanist] 34.121: annotating notation (44)

    [2]    From: Desmond  Schmidt 
           Subject: Re: [Humanist] 34.121: annotating notation (47)

    [3]    From: Dr. Herbert Wender 
           Subject: Re: [Humanist] 34.121: annotating notation (66)

    [4]    From: Hugh Cayless 
           Subject: Re: [Humanist] 34.117: annotating notation (60)

    [5]    From: Iian Neill 
           Subject: Re: [Humanist] 34.121: annotating notation (23)

        Date: 2020-06-19 04:40:21+00:00
        From: C. M. Sperberg-McQueen 
        Subject: Re: [Humanist] 34.121: annotating notation

In Humanist 34.121, Desmond Schmidt offers a useful concise summary of
the document model described by Peter Robinson in 34.117 and asks for
corrections if he has missed any essential details.

I would suggest only one addition: it is an essential point of PR’s
model that while each tree is ordered, leaves shared among trees need
not have the same ordering in those trees; DS’s suggestion that the
shared leaf nodes form a spine might otherwise allow some readers to
suppose that the trees will always order leaves the same way.

Many people have wished not to hear any more comparisons of JSON and
XML (a slightly odd wish, perhaps, in a thread that began with PR
making such a comparison, diverting an earlier discussion of computer
architectures in order to do so), so I will not detain readers of this
list with references to the literature in which similar or contrasting
proposals for multiple hierarchies have been proposed.  It would be
remiss, however, not to note the similarities between PR’s Textual
Communities model and the models of XStandoff (first described under
the name Sekimo Generic Format, developed by Maik Stührenberg and
others in Bielefeld as part of the Sekimo project) and MulTiX
(developed by Sylvia Calabretto and others in Lyon).  If anyone doubts
Desmond Schmidt’s observation that the Textual Communities model and
similar models can be implemented using XML and XML tools, those
projects demonstrate that he is correct.

Anyone interested in the history of discussions of how to handle
complex structures, multiple hierarchies, or non-hierarchical
structures in or out of XML can do worse than consult the proceedings
of Balisage [1], in which the topic is well represented.


PR’s description — and even more so DS’s terse summary — has the
distinction of being shorter and easier to follow that some of the
other descriptions.  That such similar models have been developed
independently by different groups may suggest a certain generality and

C. M. Sperberg-McQueen
Black Mesa Technologies LLC

        Date: 2020-06-18 21:23:24+00:00
        From: Desmond  Schmidt 
        Subject: Re: [Humanist] 34.121: annotating notation

Dear all,

I'd just like to add a few observations to what I said earlier about
the TC model. There are only a small number of possible solutions to
the overlapping hierarchies problem - and by that I mean "hierarchies
that overlap", not "overlap" in general. What TC apparently does is to
atomise a document into its constituent parts - text and nodes - and
store them in a database. The alternative would be to store whole XML
or JSON documents, which wouldn't achieve very much. You'd still have
an XML document with all its limitations. Converting it to JSON won't
help either. JSON is great as a web services communication format, but
is awful at representing the kinds of documents that Humanists want to
make out of original sources. The approach in TC is to try to leverage
the database's own querying language and its implementation, which in
mongoDB's case is Javascript rather than SQL, to try to answer
questions like "in which quire or column is this line of poetry?"
Although it does this successfully, the drawback seems to be that all
the constituent bits of my documents are also mixed in with everyone
else's. MongoDB has the concept of collections (which in the
relational model are tables), but it doesn't have a finer subdivision
of "the documents of Fred" or or even "The Merchant of Venice".

What would I do in his place? I'd get rid of the atomisation of the
documents but also the two trees, and represent the former XML
elements as two sets of flat properties. That is, named ranges that
point to the same fragments of text as did the original XML or TC
nodes. The advantage here is that you can represent not only
overlapping hierarchies but also plain overlap - say a case of two
kinds of underlining. And although it would be convenient to keep the
document and act-of-communication sets of properties separate, you
could also combine them because it wouldn't matter if they overlapped.
So instead of one node per document in the database you would have one
SET of nodes representing each view of the text. This is what we do in
Ecdosis (if I may make a free plug) but it does show that it works in
practice. You can then convert the properties and text into HTML,
although that does require a custom tool. But you're writing one
anyway - what else is TC? I would suggest therefore a similar approach
to him also. It's not an original idea - LMNL for example does it that
way too - but it is a good and proven approach.

I hope this doesn't sound too technical, but as Michael says, digital
humanities is really a fusion of technical and humanistic

Desmond Schmidt

        Date: 2020-06-18 19:27:04+00:00
        From: Dr. Herbert Wender 
        Subject: Re: [Humanist] 34.121: annotating notation

First observation looking at the last communications in this thread: a certain
caution to speak about hierarchies when speaking about trees as possibility of
(alterrntive) structural descriptions. This seems important insofar as it is
really unclear what is meant by 'hierarchy' resp. how it should defind as in
textual/digital scholarship. To realize the problem I recommend the comparison
of senence structure tree-like visualized in alternative approaches to sentence

Second observation: Naturally therre is a strong association between working
areas and preferences for certain kinds of focussiong textual communiation: the
quirrel-level in TC's document tree shows a strong affinity to book science
(bibliography) or, broader taken, to the material side of the medium in
literary communications (including all the chain from authors via publishers and
printing houses to the public). If we compare a classical TEI encoding in the
OTA, "Lord Jim" by Joseph Conrad, we will see a great negligeance toward the
transmission chain of the encoded text. Both ways to approach literary texts
stand in their own rights, but to think they should/could be combined in one
desccription sufficient for all purposes IMHO causes unnecessary complication of
basic efforts in textual scholarship. I think that is the central shortcoming of
the TEI approach, not a false notion of textual hierrarchies. It was always a
bad idea if an editor thougt he could produce a definitive edition. (Why the
editors of the digital Faust edition are not present in this discussions. What
have they to say about the costs to pay 'earlier or later', as HC has stated.
There is an omen that such editions mostly were first class funeral for
classical authors; you have to pay now, and much more than usual, without
proportional beniefit in the future. That will not say 'Leave it' but 'Make it
as simple ass possible', as DS has stated earlier in this thread.)

Third observation: Such discussions seem dominated by an unresitable drive to
the implementational side of the problems, instead of going to the ground of
the theoretical problems underlying the different algorithmic approaches. BTW,
theoretical discussion will not exclude examples, and to give my two or three
cents out of my own working area I will show you 3 snippets delivered by Google
Books on search string "Leonce Zufall Lena Vorsehung" (with option date=19th 
cent.) out of a playwright by Georg Buechner (posthum publ.). Probably human 
readers will easily recognize the potential 'content objects' in a 
TEI/XML-hierarchy resp. potential atoms in TC's 'string pool' to be attached 
to TC-trees. It should be not very difficult to instruct/train a machine(program 
to do the same. But I would prefer to see it as a constellation of dramatic 
figures in a performance surprisingly recognizing each other as prince and 
princess, and the question is open who is saying "Zufall" and who "Vorsehung", 
and maybe the author was intending a certain overlap in the uttering of these 

Kind regards, Herbert

1842 (ed. Karl Gutzkow)
... Leonce.Lena?Lena.Leonce?Leonce.Ei Lena,ich glaube, das war die Flucht in das
Paradies. Ich bin betrogen.Lena.Ich bin betrogen. Leonce.O Zufall!Lena.O
Vorsehung!Valerio. Ich muß lachen. Eure Hoheiten sind wahrhaftig durch ...

1850 (ed. Ludwig Büchner)
... Lena?L e n a. Leonce?Leonce. Ei Lena,ich glaube, das war die Flucht in das
Paradies. Lena.Ich bin betrogen. Leonce.Ich bin betrogen. Lena.O Zufall! -
Leonce.O Vorsehung!Valerio. Ich muß lachen, ich muß lachen. Eure Hoheiten
sind ...

1879 (ed. Karl Emil Franzos)
...  Die Prinzessin! Leonce.Lena?Lena.Leonce?Leonce.Ei Lena,ich glaube, das war
die Flucht in das Paradies. - Lena.Ich bin betrogen. Leonce.Ich bin betrogen.
Lena.O Zufall!Leonce.O Vorsehung!Valerio. Ich muß lachen, ich muß lachen ..

        Date: 2020-06-18 17:53:13+00:00
        From: Hugh Cayless 
        Subject: Re: [Humanist] 34.117: annotating notation


I'm glad only one thing I said made you uneasy, and I share your principle
of avoiding overcomplication in system design. But what you have to say
interests me:

> However, for web applications at least I think schemas are not needed
> because invalid inputs can be dealt with successfully within the
> application.

This means to me that you prefer that constraints be discovered later on,
presumably in the form of bugs, and that those constraints be added (fixed)
at processing time, rather than at creation time. Your preference is to pay

But simplicity is its own constraint, surely. It seems obvious that one can
achieve a complex representation using only simple techniques, as long as
the simple techniques are composable. But if you work without controls on
your inputs I'd have thought that invites accidental complexity. I suspect
you mean to handle that by insisting on very limited formats as inputs,
putting the complexity (or intelligence) in the application layer rather
than the data layer.

Applications are brittle things, however. My own preference is to put
"intelligence" in the data rather than in the code, where possible, in part
because I assume the data are going to last longer than the application,
but also because (as a programmer) I expect code (my own not excluded) to
be sloppy, broken, and riddled with errors. The structure of data seems to
me easier to (dare I say) validate. On the other hand, "data is code" :-).

Maybe this is a useful topic for discussion?

All the best,

>         Date: 2020-06-16 09:29:56+00:00
>         From: Desmond  Schmidt 
>         Subject: Re: [Humanist] 34.113: annotating notation
> Hugh,
> I don't necessarily disagree with what you say below. However, for web
> applications at least I think schemas are not needed because invalid
> inputs can be dealt with successfully within the application.
> The only thing I feel uneasy about is when you say:
> "I would argue that schemas allow you to sustain a higher level of
> complexity than you would be able to otherwise."
> This suggests to me that schemas may be used as a justification for
> introducing greater complexity than is really needed. Whereas I would
> suggest that a divide and conquer approach, for example, separating
> metadata or annotation or variation from the text being encoded may
> result in a simpler overall data model. Since you have been working
> with external annotation recently (Recogito) perhaps you can see the
> advantage of this type of approach. There is only so much complexity
> that people can take, or upon which a successful user interface can be
> built. If we believe William of Occam: "Entities are not to be
> multiplied beyond necessity." This sounds a bit like an admonition to
> the TEI.
> Desmond

        Date: 2020-06-18 06:30:57+00:00
        From: Iian Neill 
        Subject: Re: [Humanist] 34.121: annotating notation

Mr Sperberg-McQueen,

I have been working on a standoff property edition of the letters of
Michelangelo and the Florentine Diary of Luca Landucci in a system called
"Codex", which uses the SPEEDy standoff property text editor in conjunction
with the Neo4j graph database. In these editions stylistic, layout,
semantic and syntactic annotations are used - including TEI - and they can
all overlap freely. Pages, lines, and soft hyphens are also modelled and
intersect freely without any conflict. This system is currently being used
on the Hildegard von Bingen Letters Project at the Johannes-Gutenberg
Institute at Mainz.

I am happy to provide examples of the standoff property text JSON, of the
text-as-a-graph models, and the interfaces that are used to manage the text
and its relations.

I should mention that the editor allows for the text to be changed in
real-time without invalidating the standoff property character indexes.

Best regards,

Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php

Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.