             Date: Mon, 25 Sep 2000 06:52:31 +0100
             From: "Jim Marchand" <marchand@ux1.cso.uiuc.edu>
             Subject: Re: 14.0263 letter frequency in Latin?
    The question as to the frequency of letters in Latin is interesting
    and confronts us with a number of basic problems.  These may seem trivial,
    but I can assure you they are not.  First: What is a
    language and how can we delimit it?  Language is one of those words like
    _is_ which we glibly use, but scarcely ever define.  Secondly, what is
    Latin?  Just looking at Olmsted's Index to Language 26-30 (LSA 1955): Latin,
    Latin, Archaic; Latin, British; Latin, Church; Latin, Classical; Latin,
    Colloquial; Latin, Early; Latin,
    Hispeeric; Latin, Imperial; Latin, Late; Latin, Low; Latin,
    Medieval; Latin, Neeo; Latin, Old; Latin, Patrtistic; Latin,
    Pauline; Latin, Renaissance; Latin, Republican; Latin, Vulgar,
    etc., and I have not been careful to list them all.  Letters
    themselves offer numerous problems.  How about diphthongs, often
    spelled, e.g. ae, as ligatures. The standard lists are in what we
    nowadays would call ASCII (restricted), so that German contains no
    umlauts, French no accents, etc.  And what is the purpose of the
    list?  There was at one time a great movement to discover the
    frequency of sounds in various languages, and George Zipf collected these in
    search of support for his law of least effort, etc.  In
    fact, a glib answer to the question might be: Look at G. K. Zipf,
    he must list them somewhere. (for example: G. K. Zipf and F. M.
    Rogers, "Phonemes and variphones in four present-day Romance
    languages and classical Latin from the viewpoint of Dynamic
    Philology," Archives Nerlanddaises de Phontique Exprimentale 15
    (1939), 111-147.
        One might, for example, take any large corpus and count the
    letters (many `concordance' programs [e.g. TACT, available for ca.
    $50 from the Modern Language Association] will do this for you).
    Or, one might take one of the concordances (or several of the
    concordances available), some of which list as lagniappe the letter
    frequencies of the corpus they are working with.  This is not very
    `scientific', but will work well for sloppy work; after all, we all
    know that the sequence of the frequency of English letters is
    etaoinshrdlump, as Pogo assures us and Vanna White demonstrates each weekday
    My own count of Latin, made by running a text (the Five Books of
    Moses, j and i, v and u distinguished; ligatures expanded) of the
    Vugate through TACT, looks like this: e a i o t n l r s c m d p u
    v b g h f q z j x.  I have, naturally, left out y and k.
    The question may not have an answer.
    In the Humanist archives is a thread on etaoin shrdlu, which you
    could retrieve by searching shrdlu.
             Date: Mon, 25 Sep 2000 06:53:23 +0100
             From: Anne Mahoney <mahoa@bu.edu>
             Subject: letter frequency in Latin
    In a note to be published this year in Classical Outlook, my colleague Jeff
    Rydberg-Cox and I address this question.  We counted the letters in the Perseus
    Latin corpus and found that the relative ranking of letters is not too
    from that in English, except that 'i' and 'u' rank significantly higher
    than 'o'
    -- not surprising, given that they do double duty as consonants.
    The figures are as follows:
    letter  percent (rounded)
    e       9.3   (727,785 occurrences)
    i       8.9
    u       8.7
    a       6.8
    t       6.5
    s       6.0
    r       4.9
    n       4.9
    m       4.5
    o       4.4
    c       3.2
    l       2.5
    d       2.4
    p       2.2
    q       1.4
    b       1.1
    g       0.8
    f       0.8
    h       0.7
    x       0.3
    y       0.1
    k       0      (434 occurrences)
    w       0      (322)
    z       0      (307)
    At the time there were no 'j' in the Perseus texts (though 'j' does occur in
    some of our schoolboy commentaries).  The corpus is not consistent about
    'u' and
    'v', since we've retained whatever was in the original print editions, so we
    simply counted all 'v' as 'u'.  We also did not attempt to weed out Roman
    The corpus we counted was about 7.8 million characters (letters, digits, and
    punctuation), from Plautus, Caesar (BG), Catullus, Cicero (orations and
    letters), Virgil, Horace (Odes), Livy (books 1-10), Ovid (Metamorphoses),
    Suetonius (Caesars), the Vulgate, and Servius's commentary on Virgil.  Because
    this corpus is so heterogeneous, a lot more work could be done on refining the
    We did not look at letter sequences at all, and I don't think I've ever seen
    anything on that subject for Latin.
    --Anne Mahoney
    Perseus Project

