Humanist Discussion Group

Humanist Archives: June 10, 2021, 5:54 a.m. Humanist 35.77 - obsolescence of markup

				
              Humanist Discussion Group, Vol. 35, No. 77.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org


    [1]    From: Tim Smithers <tim.smithers@cantab.net>
           Subject: Re: [Humanist] 35.67: obsolescence of markup? (66)

    [2]    From: Martin Wynne <martin.wynne@bodleian.ox.ac.uk>
           Subject: Re: [Humanist] 35.74: obsolescence of markup (48)

    [3]    From: Manfred Thaller <manfred.thaller@uni-koeln.de>
           Subject: Re: [Humanist] 35.74: obsolescence of markup (152)

    [4]    From: Neven Jovanović <filologanoga@gmail.com>
           Subject: Re: [Humanist] 35.74: obsolescence of markup (36)


--[1]------------------------------------------------------------------------
        Date: 2021-06-09 17:06:34+00:00
        From: Tim Smithers <tim.smithers@cantab.net>
        Subject: Re: [Humanist] 35.67: obsolescence of markup?

Dear Willard,

I think of markup as embedded rendering commands.  Things like
TeX/LaTeX, or HTML. Will this kind of markup become obsolete?
Yes, when the corresponding rendering engines change, or when
we want to render the same content in a different way, beyond
what the current rendering engines can do.

Adding something to content about what (we humans think) that
content is, or is about, I would say is a kind of annotation,
which you may or may not want to render.  If you do, the may
it become obsolete question has the same answer.

Formalising this kind of annotation, I would say, is a matter
of knowledge representation, which, presumably, you want to do
to support some kind of automated reasoning about the
annotated content.  I would not call these annotations markup,
but, similarly, when the reasoning engine changes, so your
knowledge representation annotations will become obsolete.

We now have plenty of examples of both these kinds of
obsolescence.  Still, we can, and sometimes do, build
emulators with which to re-animate some obsolete rendering
markup, or some obsolete knowledge representation.  (More
often we do this emulator building to re-animate old, but once
popular, computer games.)

As long as we still have computation (a la Turing/Church/Post)
and computers to make it happen, we'll always have a way back.
If the world moves over to Quantum Computing, may be not.
Then it'll be up to Maxwell's daemon, I guess -:)

Best regards,

Tim


> On 04 Jun 2021, at 11:53, Humanist <humanist@dhhumanist.org> wrote:
>
>                  Humanist Discussion Group, Vol. 35, No. 67.
>        Department of Digital Humanities, University of Cologne
>                               Hosted by DH-Cologne
>                       www.dhhumanist.org
>                Submit to: humanist@dhhumanist.org
>
>
>
>
>        Date: 2021-06-04 07:07:11+00:00
>        From: Willard McCarty <willard.mccarty@mccarty.org.uk>
>        Subject: obsolescence of markup?
>
> Currently (correct me if I am wrong) markup intervenes to embed human
> intelligence about an object where artificial processes of detection and
> analysis fall short. Does this not suggest that some kinds of markup will
> become obsolete at some point? (I do not have in mind scholarly
> commentary!) Has anyone speculated intelligently along these
> lines?
>
> Yours,
> WM
> --
> Willard McCarty,
> Professor emeritus, King's College London;
> Editor, Interdisciplinary Science Reviews;  Humanist
> www.mccarty.org.uk

--[2]------------------------------------------------------------------------
        Date: 2021-06-09 09:40:06+00:00
        From: Martin Wynne <martin.wynne@bodleian.ox.ac.uk>
        Subject: Re: [Humanist] 35.74: obsolescence of markup

Dear Herbert,

This is one of the more unusual methods that I have seen to report an
error in a text, but it is in any case gratefully received, and the
error has been corrected in XML, HTML and plain text versions of the
text in the OTA, to be found at http://hdl.handle.net/20.500.12024/3048
<http://hdl.handle.net/20.500.12024/3048>.

I'm not sure where you get the idea that the file was "multiply checked
in advancing from verrsion to verrsion". It's a nice idea, and if anyone
is interested in funding an initiative to manually check the 60,000 plus
texts that we curate pro bono for the benefit of the community, please
get in touch.

Yours,
Martin Wynne
Oxford Text Archive

On 09/06/2021 07:16, Humanist wrote:
>          Date: 2021-06-08 13:16:55+00:00
>          From: Dr. Herbert Wender<drwender@aol.com>
>          Subject: Re: [Humanist] 35.71: obsolescence of markup
>
> Dear Jonah,
>
> before going up into the higher spheres of scholarly 'intuition' I would
propose
> that we look at the state of the art in one of the most important markup
places
> in a world not always the best of all possibles. I thought we wouldn't need AI
> but just simple-minded algorithms to clean-up the markup in classical
electonic
> editions. My example: "Lord Jim" by Joseph Conrad. The institution curating
the
> XML-file: The Oxford Text Archive. The original encoder: Michael Sperberg-
> MyQueen. The Google test:
>
> Lord Jim Conrad, Joseph, 1857-1924 University of Oxford Text ...
> https://ota.bodleian.ox.ac.uk  › xmlui › bitstream › handle... and suddenly,
> giving up the idea of going home, he took a berth as chief mate of the
> rend=italic>Patna . The Patna was a local steamer as old as the hills,
lean ...
> It's not a question of intuition, it's just a mistake that I wouldm' have
> expected in a file which was - so said by the TEI header - multiply checked in
> advancing from verrsion to verrsion
> .
>
> Regards, Herbert

--[3]------------------------------------------------------------------------
        Date: 2021-06-09 09:13:49+00:00
        From: Manfred Thaller <manfred.thaller@uni-koeln.de>
        Subject: Re: [Humanist] 35.74: obsolescence of markup

Dear Willard,

you raise a every interesting point by the following sentence ...

Am 09.06.2021 um 08:16 schrieb Humanist:
> Thus an example: metatext that says "this is a paragraph" versus
> metatext that comments on the author's likely intention in breaking the
> flow of prose in the particular version in question.
particularly in the context of your further remark:

> Is there yet another argument here for standoff markup?

A very interesting question, but one I am a bit reluctant to take up, as
it is prone to re-raise a dispute about the principles of embedded
markup, which during the last decades became almost ritualistic in my
opinion.

The underlying question is simply: Is there a difference between the
representation of a text and its interpretation? One of the oldest
inventions of the editorial disciplines has been the separation of
apparatus criticus and the "Sachkommentar", acknowledging, that the two
<emph>should</emph> be the result of different and above all independent
intellectual processes.

The answer of Michael Sperberg-McQueen has always been: ALL we can say
about a text is so uncertain, that that difference is meaningless; a
minority, including me, have always considered, that this difference is
the fundament for any kind of historical analysis, even if, being human,
our subconscious may find it difficult to separate the two as much as we
would like on the conscious level.

Forgive me an argument looking possibly convoluted, before arriving at
your question on the implications of the rise of algorithmic
"interpretations".

Forgive me, furthermore, if I speak of "sources" in the following,
rather than text. I support the notion that any document contains a lot
of non-textual information, so the implicit claim that a "text" in the
more general sense can be represented meaningfully as a series of
Unicode characters - or any other finite set of glyphs - tastes a bit
shallow. (Though it was of course acceptable, until we could do better,
as in the time of leaden fonts.)

Before the usage of high level OCR software and entity extraction
algorithms, we have the following levels to represent:

Level alpha: A series of codes implementing the best of possible
representations of a source.

Level omega: Some interpretations by an intentional intellectual process.

or rather:

Level alpha: A series of codes implementing the best of possible
representations of a source.

Level omega_Smith: The interpretations of the source by Smith.

Level omega_Miller: The interpretations of the source by Miller.

If you take it serious, that both Smith and Miller can operate on an
agreed upon representation but disagree at the same time on the
interpretation, I honestly cannot see, how you can combine the two omega
levels in one embedded markup system - or support changes on one level,
without endangering the integrity of another. And the number of omega
interpretations is of course n, not two.

Your question regarding advanced systems of feature extraction / entity
extraction would assume that the situation becomes slightly more
complicated, as:

Level alpha: Some series of codes implementing the best of possible
non-semantic representations of a source.

Level beta_InHonorOfTuring: Some series of feature / entity descriptions
proposed by algorithm "Turing".

Level beta_InHonorOfShannon: Some series of feature / entity
descriptions proposed by algorithm "Shannon".

Level omega_Smith: The interpretations of the source by Smith, partially
based on the results of "Turing".

Level omega_Miller: The interpretations of the source by Miller,
partially based on the results of "Shannon".

Sorry, if I do NOT speak about AI here - in my opinion most feature
extraction / entity algorithms today are quite sophisticated rule based
systems, but nothing even remotely autonomous. ("Most" - those that I
feel comfortable to say anything about their actual operation.)

In any case, I do not really understand how you could use embedded
markup for such a stack, particularly if it is supposed to be dynamic ...

For readers appreciative of polemic statements, I attach an excerpt of a
paper of mine, dealing with part of that argument.

Yours,
Manfred

Appendix:

The underlying assumption that all texts can be transcribed as well
understood standardized characters I called unrealistic in [Thaller
2020] and indeed already in [Thaller 1993, 268-269]. Unless we assume
that historians totally and completely understand the sources they
encounter. Such a complete understanding may seem natural to linguists
or computer scientists or even to philologists who live in the clean
world of standardized printed editions; for a historian they are simply
unacceptable in my opinion.

This is more serious, the more committed we are to our starting
principle, that there is a difference between representation and
interpretation. As [Coombs 1987, 935 ] has argued, there is a level of
handling texts where interpunctuation and markup are simply the same. I
agree, but I must point out, that there is a difference between an
interpunctuation mark, which is in the transmitted text (where the
punctuation mark is a device by the original author to express a meaning
we may or may not know about) and the interpunctuation mark, which is
added by an editor (by which the editors indicate their interpretation
of the intentions of the author). When we want to clearly separate the
traces left by the author and our interpretation of those traces, we
can, therefore not simply replace a strange wiggle of the pen by a
standardized punctuation mark. Nor should we replace a spurious
semi-alphabetic abbreviation by a transcription during the
representation stage. A clean separation of representation and
interpretation, therefore, requires the possibility to transfer a source
for processing into strings, which are mixtures of standard characters,
graphics of strange wiggles of the pen and names of glyphs, which are
recognizably standardized.

That for /every/character to be represented there is a residue of doubt,
whether it is, e.g., a “f” or an “s” is of course true (as every reader
of older German fonts is very aware). The fact, that we cannot provide a
/perfectly/neutral representation without /any/influence of
interpretative opinions should not prevent us from trying our best,
however. And most of all, not let us be prevented doing our best by
allegedly unchangeable implications of information technology. To quote
Clifford Geertz: “/I have never been impressed by the argument that, as
complete objectivity is impossible in these matters (as, of course, it
is), one might as well let one’s sentiment run loose. As Robert Solow
has remarked, that is like saying that as a perfectly aseptic
environment is impossible, one might as well conduct surgery in a
sewer./” [Geertz 1973, 30]

That is a quote from my "blog post"  "On annotations and markup,
interpretations, and representation", in printable form at
https://www.researchgate.net/publication/350706311_On_annotations_and_markup_int
erpretations_and_representation

The majority of Humanist readers should NOT read it. 97.5 % of it deal
with hard core software engineering.

--[4]------------------------------------------------------------------------
        Date: 2021-06-09 08:18:50+00:00
        From: Neven Jovanović <filologanoga@gmail.com>
        Subject: Re: [Humanist] 35.74: obsolescence of markup

Dear Willard,

my experience of curating a mixed collection of texts with markup
suggests that, ideally, a collection which we want to prepare for
reuse (in unexpected contexts) should offer at least four versions of
its texts, along the spectrum from "unannotated" to "fully annotated":
1. plain text without any markup (except perhaps newlines for paragraphs)
2. texts with basic markup of structure (headings, divisions, paragraphs)
3. texts segmented into sentences and words (documenting the issues of
"what counts as a sentence" and "what counts as a word")
4. texts with all editorial annotations (clearly marked as editorial;
in my collection, they annotate different aspects from text to text)
All versions should be accompanied by sufficient metadata to identify
provenience, authors and works, referring, wherever possible, to
reasonably permanent external information collections such as VIAF or
Wikidata.

Thus, in a way, reusability implies obsolescence. A lot of NLP
experiments routinely strip off or disregard carefully and
thoughtfully prepared editorial annotations; as a researcher, I
routinely ignore annotations which are not relevant to my current
approach to the text.

And yet, what do I do when I do not ignore an annotation, but disagree
with it? I do not know a digital edition where annotations of one
editor dispute annotations of others. Do we want to have such
editions, do we need them?

(Now I see why you so often end your messages with questions.)

Best,

Neven

Neven Jovanovic, Zagreb



_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php