Humanist Discussion Group, Vol. 15, No. 16.
Centre for Computing in the Humanities, King's College London
 From: Patrick Durusau <email@example.com> (38)
Subject: Birthday Present: Concurrent Markup
 From: Geoffrey Rockwell <firstname.lastname@example.org> (9)
Subject: Happy Birthday
 From: Elisabeth Burr <email@example.com> (15)
Subject: Re: 15.012 birthday presents please!
 From: "Friedrich Michael Dimpel" <firstname.lastname@example.org- (83)
Subject: Statistical test procedures in quantitative stylistic
Date: Fri, 11 May 2001 06:40:11 +0100
From: Patrick Durusau <email@example.com>
Subject: Birthday Present: Concurrent Markup
Willard McCarty wrote:
> May I
> suggest, then, that you send to Humanist a birthday present in the form of
> a question or statement of a problem concerning humanities computing that
> bothers you most? Some piece of mental grit that gives you tsores every
> time you go on your mental way. Wonderful gift for Humanist.
A rather practical problem is the continuing inability to record and
query concurrent hierarchies in texts of interest to humanists. Such
hierarchies abound, particularly after analysis is added to such
materials. Yet, there is only one commercial application (MarkIt, Sema,
http://be.sema.com/mtc/products/index.html) that supports concurrent
markup in SGML encoded documents and such markup is excluded by
definition from XML encoded materials.
Stand-off markup is a partial solution to this problem since one can
point into a text to impose varying hierarchies (separately) on the
materials but at best it is not an elegant solution. Not to mention that
it leaves one unable to query elements for their position within another
hierarchy in the text.
While I support and applaud the success of XML and related technologies,
it is with a awareness that it does not address the fundamental need of
humanists (not to mention computing humanists) to deal with very complex
textual structures and hierarchies. I don't want to flatten complex
texts or to abandon the benefits of XML.
Any projects implementing concurrent markup (or concurrent markup like
features) that would be good sources for ideas of how to implement such
features in XML? (Post or not mentioned in Sperberg-McQueen & Huitfeldt,
_Concurrent Document Hierarchies in MECS and SGML_, Literary and
Computing, volume 14, number 1, pp. 29-42.)
-- Patrick Durusau Director of Research and Development Society of Biblical Literature firstname.lastname@example.org
-------------------------------------------------------------------- Date: Fri, 11 May 2001 06:40:54 +0100 From: Geoffrey Rockwell <email@example.com> Subject: Happy Birthday
Since we have a birth date for HUMANIST the list, I would like to ask what the birth date of humanities computing is? Is it when Father Busa started the Index Thomisticus project? Is it the first ACH conference? Is it the release of affordable personal computers? Is it the release of usable concordance software for a personal computer? Is it the first course in humanities computing, the first centre, or the first programme?
Once we have a date or dates we can celebrate more often.
Yours with best wishes to HUMANIST and its editor,
-------------------------------------------------------------------- Date: Fri, 11 May 2001 06:41:54 +0100 From: Elisabeth Burr <firstname.lastname@example.org> Subject: Re: 15.012 birthday presents please!
I am sorry that I didn't congratulate. This state of affaires is symptomatic of finding less and less time. It seems that our field is particularly problematic in this respect and full of dangers of trying to do too much, following up develop- ment and trying to establish computing in the humanities as an accademic field. How should humanists cope? How do others cope?
All my best wishes to this wounderful list and thank you, Willard, for keeping it going. Elisabeth
Prof'in Dr. Elisabeth Burr Universitaet Bremen / FB 10 - Romanistik D - 28334 Bremen email@example.com / Elisabeth.Burr@uni-duisburg.de http://www.fb10.uni-bremen.de/homepages/Burr/
-------------------------------------------------------------------- Date: Fri, 11 May 2001 06:43:42 +0100 From: "Friedrich Michael Dimpel" <firstname.lastname@example.org> Subject: Statistical test procedures in quantitative stylistic analysis
I am working on a doctoral thesis in German Medieval Studies and I am turning to you with a question concerning statistical test procedures in computer-aided quantitative stylistic analysis.
The aim of my study is to develop a body of programs designed to examine Medieval German epics or passages of them with respect to statistical differences. In the second part of my paper I want to demonstrate some applications of these programs. The overall aim - as in most projects in the area of literary criticism using quantitative stylistic analysis - is to find statistical evidence in addition to the arguments of scholarly criticism.
The programs cover a multitude of distinguishing features: simple quantitative data such as length of words or verses, frequencies of vowels and consonants, some stylistic devices which can be easily captured, function words, words and combinations of words which are particularly frequent, as well as some syntactical and metrical parameters.
I hope that my programs will contribute arguments for the following questions: - In general: Are there significant differences between the texts examined? - Are there variations within the work of one author with respect to his/her style, e.g. if there is a literary model that the author draws on for parts of his/her text? - Can texts or passages of a text of one author be assigned to the same or different periods of his/her literary production? - Can texts the authorship of which is uncertain be assigned to one or several authors? For an investigation of the last two questions, several texts will certainly have to be examined for comparison.
The programs are intended to be designed not for my use only. I intend to give them a structure and documentation which makes it possible for any medievalist to apply them even if he or she has no knowledge of programming languages. The user shall be able to segment a given text, to adapt the lists of function words and to determine the scope of the intended analysis.
My question concerns the statistical test procedure which is used to determine if the differences found between two texts or samples which were compared are statistically significant or not.
Up to now I have been using the Wilcoxon-White-Test (also called Man-Whitney-Test) as a test of statistical significance. For this purpose, the program segments the texts to be examined into paragraphs which are each 100 verses long. For each paragraph, the frequency of the respective stylistic feature is recorded so that the text segments can be put in an order according to the frequency of the respective stylistic feature.
I chose this test since Adam Kilgarriff (among others) recommended it. ("Which words are particularly characteristic of a text? A survey of statistical approaches", http://www.itri.brighton.ac.uk/~Adam.Kilgarriff/publications.html#199 6). I preferred the Wilcoxon-White-Test over the Log-Likelihood- Test, which is also recommended there, because I expect medium to high frequencies for the stylistic features I want to examine in the rather long texts or text passages (at least 1000 verses).
I have now been made a little unsure by the essay by David I. Holmes'. In view of the many studies based on multivariate methods in the last few years, Holmes states: "Principal Component Analysis is a standard technique in multivariate statistical data analysis. [...] The trend towards usage of multivariate statistical methods is now so established in stylometry that it is unusual to find papers which do not use them." (The Evolution of Stylometry in Humanities Scholarship, LLC 13, 1998, S. 113f.)
I have now become unsure about the question how efficient the Wilcoxon-White-Test is, respectively if 'unusual' here is to say 'wrong' or 'anachronistic'. I should be extremely grateful for any ideas or suggestions on this topic.
On the one hand I want to apply an adequate test procedure, on the other hand I cannot claim to fully understand PCA. PCA would furthermore clash with my intention to make the programs accessible to a mulititude of Medievalist colleagues, because for all I can see, some knowledge about statistics is required not only for the implementation of the test procedure but also for the evaluation. It seems to me that the Wilcoxon-White-Test is considerably easier to handle, requiring only the judgement if two texts differ with respect to a certain feature significantly, that is at a probability of more than 95%, or not significantly.
I would be grateful for any comments.
Friedrich Michael Dimpel
Friedrich Michael Dimpel M.A. Institut fr Germanistik Bismarckstr. 1, 91054 Erlangen Tel./Fax: 09131-85 22186 (10-12 Uhr) email@example.com
This archive was generated by hypermail 2b30 : Fri May 11 2001 - 10:42:39 EDT