Home About Subscribe Search Member Area

Humanist Discussion Group


< Back to Volume 32

Humanist Archives: March 19, 2019, 5:37 a.m. Humanist 32.560 - illusions of progress

                  Humanist Discussion Group, Vol. 32, No. 560.
            Department of Digital Humanities, King's College London
                   Hosted by King's Digital Lab
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org


    [1]    From: desmond.allan.schmidt@gmail.com
           Subject: Re: [Humanist] 32.553: illusions of progress (307)

    [2]    From: desmond.allan.schmidt@gmail.com
           Subject: Re: [Humanist] 32.553: illusions of progress (56)

    [3]    From: Jim Rovira 
           Subject: Re: [Humanist] 32.553: illusions of progress (33)


--[1]------------------------------------------------------------------------
        Date: 2019-03-17 19:58:56+00:00
        From: desmond.allan.schmidt@gmail.com
        Subject: Re: [Humanist] 32.553: illusions of progress

Elisa,

I couldn't read your collation output. A few observations on the main text:

> Indeed, a crucial and intensive phase of our work is "spinal
> adjustment" on the part of the scholarly editor who must correct the
> sometimes snagged and misaligned output of the collation software. (I'd
> contend that if you aren't working carefully over this on a Variorum,
> you're probably not doing your text-scholarly homework, and your Variorum
> won't be as meaningfully readable--you might make it fast, but it might not
> be reliable in its alignments.)
...
> The manuscript
> witness does not match the other editions for quite some time in this
> stretch of text, but at this moment, I begin to see semantic parallel, just
> on the cusp of the next passage where all the witnesses align around the
> phrase that starts the next new sentence in all editions:

> I can't rely on computer systems to deliver this particular alignment the way
> it should be, though they aligned what they could.

In these passages you admit that the alignment contains many mistakes.
And yet you complained when I told you that deletions were not
collatable in the same text as additions and substitutions. I know
what you are trying to do. It is what the Huygens Institute people
already tried with CollateX. They are still trying. Its designer
Ronald Dekker I heard already gave up on this particular task because
he regarded it as too hard - and I agree with him, although I would
rather say it is impossible. If you're happy with misalignments and
errors that you have to manually fix up after the collation engine has
done it work, you are welcome to that work. If you ever have to change
the transcriptions on which the collation is built you will  have to
re-run the collation and do the manual work all over again. I do not
want to ask my users to do that, even if they are meticulous textual
editors. A collation algorithm should simply work. You should just
press a button and a usable result should pop out. That is why we have
computers to do all the hard work for us.

Let me briefly explain the basic algorithms inside CollateX and
NMerge, my program which is incorporated in the Charles Harpur
Critical Archive and its edition platform Ecdosis.

I understand CollateX uses an algorithm based on Eugene Myers 1986
diff algorithm. Doubtless Joris if he is reading this will correct any
inaccuracies here. This algorithm is not as good as the earlier one by
Esko Ukkonen which Myers did not know when he wrote it but they are
both based on a diagonalisation technique that uses the familiar edit
square. I recommend you look at Myers' article. It has a wonderful
illustration of the "snakes" or passages where both texts align. Then
it branches out probing the mismatches until it finds another snake.
The connected snakes then form paths from the start to the end and the
first one to reach there wins. That's the best alignment.

My algorithm is based conceptually on several sequence alignment
programs. There are many similarities between sequence or protein
alignment and the collation problem: both have insertions, deletions
variants and transpositions. It is also inspired by Julien
Bourdaillet's clever MEDITE program which he wrote for his PhD in
France. It was he who gave me the idea of using Ukkonen's suffix-tree
algorithm to do collation. Once you have your suffix tree you can
quickly look up any alignment between two or more texts and so solve
the problems you are describing above.

Let me just say this, since you described me before as being just "a
programmer": I am equally qualified in the humanities and in
Information Technology. I was formerly a classicist who worked on
papyri. I now work in eResearch where I work with mostly scientists,
many of them biologists with similar problems to the ones we are
looking at here. In my spare time I work on the Charles Harpur edition
developing interactive user interfaces, and also on editing and
historical research. You can even read about the general digital
scholarly editing system we have developed on the ecdosis.rocks
website. It does not use XML.

- Ecdosis Technical design:
http://charles-harpur.org/About/Technical%20design/Overview/
- My posting on Ukkonen's Suffix tree:
http://programmerspatch.blogspot.com/2013/02/ukkonens-suffix-tree-algorithm.html
- How fast Can we diff two Strings?
http://digitalvariants.blogspot.com/2011/03/implementing-ukkonnens-diff-
algorithm.html
- Ecdosis.rocks: putting historical documents on the Web: http://ecdosis.rocks
- J. Bourdaillet and J-G. Ganascia (2006) MEDITE: A Unilingual Textual
Aligner, https://link.springer.com/chapter/10.1007%2F11816508_46
- E. Myers (1986) An O(ND) Difference Algorithm and Its Variations
(1986) http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.6927

Desmond Schmidt
eResearch
Queensland University of Technology

On 3/16/19, Humanist  wrote:
>
> --[3]------------------------------------------------------------------------
>         Date: 2019-03-15 07:47:27+00:00
>         From: Elisa Beshero-Bondar 
>         Subject: Re: [Humanist] 32.549: Illusions of Progress & the price of
> manipulability
>
> Dear Humanists,
>
> I'm replying to remarks made recently about the Frankenstein Variorum, with
> some interesting code that may be worthy of continued discussion. Herbert
> has raised an issue about our critical apparatus, though located not our
> apparatus code but rather some pre-collation code from the Frankenstein
> Variorum to feature on this list. I'd like to share a view of the
> manuscript notebook passage that Herbert selected in comparison with the
> other versions of Frankenstein at that moment. I think it is one of the
> more interesting portions of critical apparatus in the Variorum--one of the
> reasons I'm excited to be working on it, and I also think it will raise an
> issue about locating semantic comparison across hierarchical structures
> worth thinking about on this list.
>
> Fair warning: the code I'm sharing below contains a fair amount of
> flattened markup because we collate markup from our source edition files.
> Some markup is masked from collation when we've decided it isn't meaningful
> for semantic comparison (such as surface, zone, and line-beginnings). Some
> markup is preserved and processed in the collation, visible in the lists of
> tokens you'll see in the @n on rdgGrp. That attribute value gives you a
> glimpse of the string tokens that we're feeding to our collation software,
> collateX. We want some markup to be processed as significant
> distinction--that is key to our algorithm and our thinking about the
> Variorum and what we want to compare across five editions.
>
> Desmond has suggested that it must be impossible for me to collate
> deletions, but I am happy to say that is not so--we're quite successful
> with it, and indeed this extends beyond the S-GA code to incorporate
> insertions and deletions in the marginalia of the Thomas copy, a print
> edition of the 1818 Frankenstein in which MWS added marginal insertions,
> deletions, and annotations about plans to change the novel. Being able to
> see the deleted passages in the critical apparatus sometimes (as in this
> case) shows us something we might not have noticed before that aligns the
> concepts of early and later versions. By the way, Desmond, I must correct
> you: we have not one, but *two* editions in the Variorum critical apparatus
> in which we collate manuscript deletion markup. You may protest that the
> code is too difficult for humans to read, but do keep in mind that much of
> the code is computer generated and designed for parsing and
> construction--we made it for parsing, and it's the backend foundation of
> what will be an interface. We think of this code as a "spine" or spinal
> column for the Variorum, because we build up the edition from its
> foundation. Indeed, a crucial and intensive phase of our work is "spinal
> adjustment" on the part of the scholarly editor who must correct the
> sometimes snagged and misaligned output of the collation software. (I'd
> contend that if you aren't working carefully over this on a Variorum,
> you're probably not doing your text-scholarly homework, and your Variorum
> won't be as meaningfully readable--you might make it fast, but it might not
> be reliable in its alignments.)
>
> Don't blame the collation of XML markup for the time it takes to review and
> revise collation algorithms and outputs following the Gothenburg Model.
> Adjusting the spine  is not a problem caused by collating markup and it's
> certainly not a problem with the edition hierarchies when they're pretty
> well flattened here into the critical apparatus structure. No, rather, our
> problem is a ubiquitous one of coping with text strings (sometimes with
> angle brackets in them) that get snagged up on false positives like "a",
> "was", "and" surfacing like a utopian island of harmony in the midst of
> significant divergences. Once we've corrected the alignments (starting with
> automated methods, facilitated by some Schematron I write), we use
> automated methods to flatten, raise, and manipulate the code you see below
> into output editions stamped with the encoding of data about variation for
> each passage portioned in a critical apparatus. I believe I shared our Zen
> Garden of Flatten and Raise slides in a previous post, but here they are
> again: https://slides.com/elisabeshero-bondar/zenraising#/ , and the paper
> connected with them:
>
https://www.balisage.net/Proceedings/vol21/html/Birnbaum01/BalisageVol21-Birnbau
> m01.html
>
> Here is the passage about which Herbert raise a question. It is a passage
> that has no exact counterpart in the later editions, so I have taken some
> pains with it to see where I think it best aligns. This is the role of the
> text scholar, to make a decision about meaningful alignment. The manuscript
> witness does not match the other editions for quite some time in this
> stretch of text, but at this moment, I begin to see semantic parallel, just
> on the cusp of the next passage where all the witnesses align around the
> phrase that starts the next new sentence in all editions: "Natural
> philosophy". You'll almost certainly find the simpler view of the collation
> token list easier to read than the markup, and indeed that's why I
> cultivate that view for our team.
>
>  {app}
>    {rdgGrp n="['chapt.', '2', 'those', 'events', 'which', 'materially',
> 'influence', 'our', 'future', 'destinies', '<del>are<del>',
> 'often', '<del>caused<del><del>by', 'slight',
> 'or<del>derive', 'thier', 'origin', 'from', 'a', 'trivial',
> 'occurence<del>s<del>.', '<del>strange', 'as',
> 'the<del><del>statement', 'of',
> 'the<del><del>simple', 'fact<del><del>may',
> 'appear', 'my', 'fate', 'had', 'been<del>']"}
>    {rdg wit="fMS"}<gap quantity="22" reason="resequencing"
> unit="tokens"/><gap quantity="4" reason="resequencing"
> unit="lines"/><lb n="c56-0005__main__1"/><milestone
> spanTo="#c56-0005.06" unit="tei:head"/>Chapt. 2<lb
> n="c56-0005__main__2"/> Those events which materially influence our
> fu<lb n="c56-0005__main__3"/>ture destinies <del
> rend="strikethrough" sID="c56-0005__main__d2e949"/>are<del
> eID="c56-0005__main__d2e949"/> often <del rend="strikethrough"
> sID="c56-0005__main__d2e954"/>caused<del
> eID="c56-0005__main__d2e954"/><del rend="strikethrough"
> sID="c56-0005__main__d2e957"/>by slight or<del
> eID="c56-0005__main__d2e957"/>derive thier origin from atri<lb
> n="c56-0005__main__4"/>vial occurence<del rend="strikethrough"
> sID="c56-0005__main__d2e971"/>s<del
> eID="c56-0005__main__d2e971"/>. <del next="#c56-0005.02"
> rend="strikethrough" sID="c56-0005__main__d2e974"/>Strange as the<del
> eID="c56-0005__main__d2e974"/><metamark
> function="insert">^</metamark><del rend="strikethrough"
> sID="c56-0005__main__d2e985"/>statement of the<del
> eID="c56-0005__main__d2e985"/><del next="#c56-0005.03"
> rend="strikethrough" sID="c56-0005__main__d2e992"
> xml:id="c56-0005.02"/>simple fact<del
> eID="c56-0005__main__d2e992"/><lb n="c56-0005__main__5"/><del
> rend="strikethrough" sID="c56-0005__main__d2e998"
> xml:id="c56-0005.03"/>may appear my fate had been<del
> eID="c56-0005__main__d2e998"/>{/rdg}
>     {/rdgGrp}
>    {rdgGrp n="['i', 'find', 'it', 'arise,', 'like', 'a', 'mountain',
> 'river,', 'from', 'ignoble', 'and', 'almost', 'forgotten', 'sources;',
> 'but,', 'swelling', 'as', 'it', 'proceeded,', 'it', 'became', 'the',
> 'torrent', 'which,', 'in', 'its', 'course,', 'has', 'swept', 'away', 'all',
> 'my', 'hopes', 'and', 'joys.<p/>']"}
>    {rdg wit="f1818"}I find it arise, like a mountain river, from ignoble
> and almost forgotten sources; but, swelling as it proceeded, it became the
> torrent which, in its course, has swept away all my hopes and joys.<p
> eID="novel1_letter4_chapter1_div4_div1_p14"/> {/rdg}
>    {rdg wit="f1823"}I find it arise, like a mountain river, from ignoble
> and almost forgotten sources; but, swelling as it proceeded, it became the
> torrent which, in its course, has swept away all my hopes and joys.<p
> eID="novel1_letter4_chapter1_div4_div1_p14"/> {/rdg}
>    {rdg wit="fThomas"}I find it arise, like a mountain river, from ignoble
> and almost forgotten sources; but, swelling as it proceeded, it became the
> torrent which, in its course, has swept away all my hopes and joys.<p
> eID="novel1_letter4_chapter1_div4_div1_p14"/> {/rdg}
>    {rdg wit="f1831"}I find it arise, like a mountain river, from ignoble
> and almost forgotten sources; but, swelling as it proceeded, it became the
> torrent which, in its course, has swept away all my hopes and joys.<p
> eID="novel1_letter4_chapter2_div4_div2_p6"/> {/rdg}
>     {/rdgGrp}
>   {/app}
>
>
> {app}
>    {rdgGrp n="['<del>chemist<del>natural', 'philosophy']"}
>    {rdg wit="fMS"}<del rend="strikethrough"
> sID="c56-0005__main__d2e1001"/>Chemist<del
> eID="c56-0005__main__d2e1001"/>Natu<lb n="c56-0005__main__6"/>ral
> philosophy {/rdg}
>     {/rdgGrp}
>    {rdgGrp n="['<p/>natural', 'philosophy']"}
>    {rdg wit="f1818"}<p
> sID="novel1_letter4_chapter1_div4_div1_p15"/>Natural philosophy {/rdg}
>    {rdg wit="f1823"}<p
> sID="novel1_letter4_chapter1_div4_div1_p15"/>Natural philosophy {/rdg}
>    {rdg wit="fThomas"}<p
> sID="novel1_letter4_chapter1_div4_div1_p15"/>Natural philosophy {/rdg}
>    {rdg wit="f1831"}<p
> sID="novel1_letter4_chapter2_div4_div2_p7"/>Natural philosophy {/rdg}
>     {/rdgGrp}
>   {/app}
>
>
> I found this passage in the manuscript of great interest to align because
> it stands out as meaningfully comparable, but not at all in the same
> language or structure as the other editions, though it leads to the same
> destination in Victor's primary study. The deletions in the MS gave me the
> signals of semantic alignment with the mountain-torrent passage in the
> print editions.
>
>  I'd like to foreground this passage in our discussion because it
> represents the old-fashioned work I have to do as a scholarly editor--I
> can't rely on computer systems to deliver this particular alignment the way
> it should be, though they aligned what they could. And yet this project is
> also full of schema validation and transformation
> pipelines--machine-assisted but human coded, too. It's not easy to read a
> manuscript edition as it compares with other editions, and we have not yet
> been able to read a clear view of passages like this in print editions with
> footnotes or margin notes that sprawl across pages. We're working to design
> a good, user-centered view of this, but I'm enjoying the intellectual work
> on the backend.
>
> Digital editions are as intricate and complicated as we made them, and I'm
> aware I've made decisions and developed algorithms that value
> complexity--probably these wouldn't be everyone's choices, but they're
> mine. (I remember when we were considering not collating deletions in the
> Variorum, and also how, when I inspected the code, I decided I really
> wanted to do it and took up the challenge.) Our algorithms and data models
> are what we require to build according to our plan. Perhaps digital
> editions are as distinctive as their editors' designs, and the scholarly
> editor works best when the data model is designed to suit the research
> questions motivating the edition.
>
>
> Cheers,
>
> Elisa
> --
> Elisa Beshero-Bondar, PhD
> Director: Center for the Digital Text
> https://www.greensburg.pitt.edu/digital-humanities/center-digital-text
> 
> Associate Professor of English
> University of Pittsburgh at Greensburg
> Humanities Division
> 150 Finoli Drive, Greensburg, PA  15601  USA
> E-mail: ebb8@pitt.edu



--[2]------------------------------------------------------------------------
        Date: 2019-03-16 15:05:13+00:00
        From: desmond.allan.schmidt@gmail.com
        Subject: Re: [Humanist] 32.553: illusions of progress

Martin,

where are the hierarchies in these "elaborate conventions of white
space and typographical marks (including letters and numbers) to
communicate a sense of structure to their readers."?

Printers worked by placing pieces of metal containing letter-shapes or
spacers within a rectangular frame. They did not group acts, scenes
and speeches within sub-frames which they sawed off at the bottom of
the page. If you removed all the "elaborate conventions" of typography
and spacing from the text the expressions "Act 1" or "Scene 3" would
still have communicated the structure of the play to the reader. There
is no intrinsic hierarchy in the  non-electronic printers' medium.

And it is not true that "Any printed page is full of 'mark-up'".
Mark-up is "The process of embedding tags in an electronic text so as
to distinguish the text's logical, syntactic, or structural
components; the tags so embedded. (OED)" I do not see them on very
many printed pages or on any printed book of Shakespeare. The ancient
Greeks used mark-up on papyri to divide poetry into sections but they
did not use fonts and styles of type, which are NOT mark-up.

Desmond Schmidt
eResearch
Queensland University of Technology

On 3/15/19, Martin Mueller wrote:
>
>>  I can't help noticing that all
>>  the printers of Shakespeare until the advent of SGML had no need of
>>  hierarchies to produce books that everyone could read. They just put
>>  black marks on a page, not caring if the text was divided conceptually
>>  into acts, scenes, speeches and lines because every reader could
>>  already see it.
>
> Whatever printers did or didn't do in a print world, they didn't
> just "put black marks on a page".  They used elaborate conventions
> of white space and typographical marks (including letters and
> numbers) to communicate a sense of structure to their readers. Leave
> aside the question whether the structure was theirs or the author's,
> and leave aside the question whether they "got it right".  Any
> printed page is full of "mark-up" and most of the time the mark-up
> doesn't change from page to page. From my very brief career as a
> student of German literature I remember a lecture about Hoffman's
> Tomcat Murr, which begins with a printer's apology. He had two print
> jobs: printing Murr's autobiography and the biography of the
> musician Kreisler. He left the windows open: a storm scattered the
> leaves and all the printer could do was to put together the pages as
> he found them.

> The charm of that fiction depends on a shared understanding of what
> printers ordinarily do. File this posting--and many of the others
> prompted by Desmond Schmidt's reflections--under Order and its
> Discontents.



--[3]------------------------------------------------------------------------
        Date: 2019-03-16 07:23:11+00:00
        From: Jim Rovira 
        Subject: Re: [Humanist] 32.553: illusions of progress

Frances:

Thanks very much for your response. I’m unsure how your discussion of
reading/text is a response to my discussion of convention/text. I think I should
have defined my terms: by “convention” I meant the physical presentation of
words on a page or screen, and by “text” I meant the words as parts of semantic
units.

For example, presenting lines of Shakespeare as continuous unnumbered lines of
prose follows one convention, while presenting them lineated and numbered with
different indentation levels is another. This is what I had in mind by different
editions of Shakespeare.

I should add here again that by this description every performance of
Shakespeare is a different edition of his text: different actors carrying out
different gestures, blocking, and intonation placed in different settings being
different conventions guiding the presentation of Shakespeare’s text. My point
is only that there is no text without the conventions that present it to us.
That would be a telepathically communicated book. This is to me a very
pedestrian idea.

Richard:

I thought that I did properly attribute your words — the quotation using the
word “corruption” — to you in my post. I used it as an example of a way of
thinking also applicable to Desmond’s post.

I did notice and should have acknowledged, though, your reticence with the
metaphor, so I apologize for that oversight.

Jim R




_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php


Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)

This site is maintained under a service level agreement by King's Digital Lab.