Humanist Discussion Group, Vol. 35, No. 77. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org [1] From: Tim Smithers <tim.smithers@cantab.net> Subject: Re: [Humanist] 35.67: obsolescence of markup? (66) [2] From: Martin Wynne <martin.wynne@bodleian.ox.ac.uk> Subject: Re: [Humanist] 35.74: obsolescence of markup (48) [3] From: Manfred Thaller <manfred.thaller@uni-koeln.de> Subject: Re: [Humanist] 35.74: obsolescence of markup (152) [4] From: Neven Jovanović <filologanoga@gmail.com> Subject: Re: [Humanist] 35.74: obsolescence of markup (36) --[1]------------------------------------------------------------------------ Date: 2021-06-09 17:06:34+00:00 From: Tim Smithers <tim.smithers@cantab.net> Subject: Re: [Humanist] 35.67: obsolescence of markup? Dear Willard, I think of markup as embedded rendering commands. Things like TeX/LaTeX, or HTML. Will this kind of markup become obsolete? Yes, when the corresponding rendering engines change, or when we want to render the same content in a different way, beyond what the current rendering engines can do. Adding something to content about what (we humans think) that content is, or is about, I would say is a kind of annotation, which you may or may not want to render. If you do, the may it become obsolete question has the same answer. Formalising this kind of annotation, I would say, is a matter of knowledge representation, which, presumably, you want to do to support some kind of automated reasoning about the annotated content. I would not call these annotations markup, but, similarly, when the reasoning engine changes, so your knowledge representation annotations will become obsolete. We now have plenty of examples of both these kinds of obsolescence. Still, we can, and sometimes do, build emulators with which to re-animate some obsolete rendering markup, or some obsolete knowledge representation. (More often we do this emulator building to re-animate old, but once popular, computer games.) As long as we still have computation (a la Turing/Church/Post) and computers to make it happen, we'll always have a way back. If the world moves over to Quantum Computing, may be not. Then it'll be up to Maxwell's daemon, I guess -:) Best regards, Tim > On 04 Jun 2021, at 11:53, Humanist <humanist@dhhumanist.org> wrote: > > Humanist Discussion Group, Vol. 35, No. 67. > Department of Digital Humanities, University of Cologne > Hosted by DH-Cologne > www.dhhumanist.org > Submit to: humanist@dhhumanist.org > > > > > Date: 2021-06-04 07:07:11+00:00 > From: Willard McCarty <willard.mccarty@mccarty.org.uk> > Subject: obsolescence of markup? > > Currently (correct me if I am wrong) markup intervenes to embed human > intelligence about an object where artificial processes of detection and > analysis fall short. Does this not suggest that some kinds of markup will > become obsolete at some point? (I do not have in mind scholarly > commentary!) Has anyone speculated intelligently along these > lines? > > Yours, > WM > -- > Willard McCarty, > Professor emeritus, King's College London; > Editor, Interdisciplinary Science Reviews; Humanist > www.mccarty.org.uk --[2]------------------------------------------------------------------------ Date: 2021-06-09 09:40:06+00:00 From: Martin Wynne <martin.wynne@bodleian.ox.ac.uk> Subject: Re: [Humanist] 35.74: obsolescence of markup Dear Herbert, This is one of the more unusual methods that I have seen to report an error in a text, but it is in any case gratefully received, and the error has been corrected in XML, HTML and plain text versions of the text in the OTA, to be found at http://hdl.handle.net/20.500.12024/3048 <http://hdl.handle.net/20.500.12024/3048>. I'm not sure where you get the idea that the file was "multiply checked in advancing from verrsion to verrsion". It's a nice idea, and if anyone is interested in funding an initiative to manually check the 60,000 plus texts that we curate pro bono for the benefit of the community, please get in touch. Yours, Martin Wynne Oxford Text Archive On 09/06/2021 07:16, Humanist wrote: > Date: 2021-06-08 13:16:55+00:00 > From: Dr. Herbert Wender<drwender@aol.com> > Subject: Re: [Humanist] 35.71: obsolescence of markup > > Dear Jonah, > > before going up into the higher spheres of scholarly 'intuition' I would propose > that we look at the state of the art in one of the most important markup places > in a world not always the best of all possibles. I thought we wouldn't need AI > but just simple-minded algorithms to clean-up the markup in classical electonic > editions. My example: "Lord Jim" by Joseph Conrad. The institution curating the > XML-file: The Oxford Text Archive. The original encoder: Michael Sperberg- > MyQueen. The Google test: > > Lord Jim Conrad, Joseph, 1857-1924 University of Oxford Text ... > https://ota.bodleian.ox.ac.uk › xmlui › bitstream › handle... and suddenly, > giving up the idea of going home, he took a berth as chief mate of the > rend=italic>Patna . The Patna was a local steamer as old as the hills, lean ... > It's not a question of intuition, it's just a mistake that I wouldm' have > expected in a file which was - so said by the TEI header - multiply checked in > advancing from verrsion to verrsion > . > > Regards, Herbert --[3]------------------------------------------------------------------------ Date: 2021-06-09 09:13:49+00:00 From: Manfred Thaller <manfred.thaller@uni-koeln.de> Subject: Re: [Humanist] 35.74: obsolescence of markup Dear Willard, you raise a every interesting point by the following sentence ... Am 09.06.2021 um 08:16 schrieb Humanist: > Thus an example: metatext that says "this is a paragraph" versus > metatext that comments on the author's likely intention in breaking the > flow of prose in the particular version in question. particularly in the context of your further remark: > Is there yet another argument here for standoff markup? A very interesting question, but one I am a bit reluctant to take up, as it is prone to re-raise a dispute about the principles of embedded markup, which during the last decades became almost ritualistic in my opinion. The underlying question is simply: Is there a difference between the representation of a text and its interpretation? One of the oldest inventions of the editorial disciplines has been the separation of apparatus criticus and the "Sachkommentar", acknowledging, that the two <emph>should</emph> be the result of different and above all independent intellectual processes. The answer of Michael Sperberg-McQueen has always been: ALL we can say about a text is so uncertain, that that difference is meaningless; a minority, including me, have always considered, that this difference is the fundament for any kind of historical analysis, even if, being human, our subconscious may find it difficult to separate the two as much as we would like on the conscious level. Forgive me an argument looking possibly convoluted, before arriving at your question on the implications of the rise of algorithmic "interpretations". Forgive me, furthermore, if I speak of "sources" in the following, rather than text. I support the notion that any document contains a lot of non-textual information, so the implicit claim that a "text" in the more general sense can be represented meaningfully as a series of Unicode characters - or any other finite set of glyphs - tastes a bit shallow. (Though it was of course acceptable, until we could do better, as in the time of leaden fonts.) Before the usage of high level OCR software and entity extraction algorithms, we have the following levels to represent: Level alpha: A series of codes implementing the best of possible representations of a source. Level omega: Some interpretations by an intentional intellectual process. or rather: Level alpha: A series of codes implementing the best of possible representations of a source. Level omega_Smith: The interpretations of the source by Smith. Level omega_Miller: The interpretations of the source by Miller. If you take it serious, that both Smith and Miller can operate on an agreed upon representation but disagree at the same time on the interpretation, I honestly cannot see, how you can combine the two omega levels in one embedded markup system - or support changes on one level, without endangering the integrity of another. And the number of omega interpretations is of course n, not two. Your question regarding advanced systems of feature extraction / entity extraction would assume that the situation becomes slightly more complicated, as: Level alpha: Some series of codes implementing the best of possible non-semantic representations of a source. Level beta_InHonorOfTuring: Some series of feature / entity descriptions proposed by algorithm "Turing". Level beta_InHonorOfShannon: Some series of feature / entity descriptions proposed by algorithm "Shannon". Level omega_Smith: The interpretations of the source by Smith, partially based on the results of "Turing". Level omega_Miller: The interpretations of the source by Miller, partially based on the results of "Shannon". Sorry, if I do NOT speak about AI here - in my opinion most feature extraction / entity algorithms today are quite sophisticated rule based systems, but nothing even remotely autonomous. ("Most" - those that I feel comfortable to say anything about their actual operation.) In any case, I do not really understand how you could use embedded markup for such a stack, particularly if it is supposed to be dynamic ... For readers appreciative of polemic statements, I attach an excerpt of a paper of mine, dealing with part of that argument. Yours, Manfred Appendix: The underlying assumption that all texts can be transcribed as well understood standardized characters I called unrealistic in [Thaller 2020] and indeed already in [Thaller 1993, 268-269]. Unless we assume that historians totally and completely understand the sources they encounter. Such a complete understanding may seem natural to linguists or computer scientists or even to philologists who live in the clean world of standardized printed editions; for a historian they are simply unacceptable in my opinion. This is more serious, the more committed we are to our starting principle, that there is a difference between representation and interpretation. As [Coombs 1987, 935 ] has argued, there is a level of handling texts where interpunctuation and markup are simply the same. I agree, but I must point out, that there is a difference between an interpunctuation mark, which is in the transmitted text (where the punctuation mark is a device by the original author to express a meaning we may or may not know about) and the interpunctuation mark, which is added by an editor (by which the editors indicate their interpretation of the intentions of the author). When we want to clearly separate the traces left by the author and our interpretation of those traces, we can, therefore not simply replace a strange wiggle of the pen by a standardized punctuation mark. Nor should we replace a spurious semi-alphabetic abbreviation by a transcription during the representation stage. A clean separation of representation and interpretation, therefore, requires the possibility to transfer a source for processing into strings, which are mixtures of standard characters, graphics of strange wiggles of the pen and names of glyphs, which are recognizably standardized. That for /every/character to be represented there is a residue of doubt, whether it is, e.g., a “f” or an “s” is of course true (as every reader of older German fonts is very aware). The fact, that we cannot provide a /perfectly/neutral representation without /any/influence of interpretative opinions should not prevent us from trying our best, however. And most of all, not let us be prevented doing our best by allegedly unchangeable implications of information technology. To quote Clifford Geertz: “/I have never been impressed by the argument that, as complete objectivity is impossible in these matters (as, of course, it is), one might as well let one’s sentiment run loose. As Robert Solow has remarked, that is like saying that as a perfectly aseptic environment is impossible, one might as well conduct surgery in a sewer./” [Geertz 1973, 30] That is a quote from my "blog post" "On annotations and markup, interpretations, and representation", in printable form at https://www.researchgate.net/publication/350706311_On_annotations_and_markup_int erpretations_and_representation The majority of Humanist readers should NOT read it. 97.5 % of it deal with hard core software engineering. --[4]------------------------------------------------------------------------ Date: 2021-06-09 08:18:50+00:00 From: Neven Jovanović <filologanoga@gmail.com> Subject: Re: [Humanist] 35.74: obsolescence of markup Dear Willard, my experience of curating a mixed collection of texts with markup suggests that, ideally, a collection which we want to prepare for reuse (in unexpected contexts) should offer at least four versions of its texts, along the spectrum from "unannotated" to "fully annotated": 1. plain text without any markup (except perhaps newlines for paragraphs) 2. texts with basic markup of structure (headings, divisions, paragraphs) 3. texts segmented into sentences and words (documenting the issues of "what counts as a sentence" and "what counts as a word") 4. texts with all editorial annotations (clearly marked as editorial; in my collection, they annotate different aspects from text to text) All versions should be accompanied by sufficient metadata to identify provenience, authors and works, referring, wherever possible, to reasonably permanent external information collections such as VIAF or Wikidata. Thus, in a way, reusability implies obsolescence. A lot of NLP experiments routinely strip off or disregard carefully and thoughtfully prepared editorial annotations; as a researcher, I routinely ignore annotations which are not relevant to my current approach to the text. And yet, what do I do when I do not ignore an annotation, but disagree with it? I do not know a digital edition where annotations of one editor dispute annotations of others. Do we want to have such editions, do we need them? (Now I see why you so often end your messages with questions.) Best, Neven Neven Jovanovic, Zagreb _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php