Humanist Discussion Group, Vol. 36, No. 489. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org Date: 2023-03-29 03:44:11+00:00 From: Michael Falk <michaelgfalk@gmail.com> Subject: Re: [Humanist] 36.484: numbers for words: advantages? Hi Henry, The reason numbers are substituted for words is that words are the unit of analysis. It’s worth remembering that in fact words are *always* represented by numbers in any statistical model. Statistical models analyse words using mathematics, and need to be able to add, subtract, multiply and divide them in order to produce an output. Basically we can distinguish three possible numerical representations for a word: 1. As a string (like you’re reading now) => one number per letter In the simplest case each letter is represented by a number. For example, in ASCII, each letter is represented by a number in the range 0-127. So ‘a’ is represented by the number 97. This roughly corresponds to 1 byte per word. You are probably reading this in UNICODE, however, where each character is represented by 1-4 bytes, allowing for millions of different letters to be represented. 2. As a word => one number per word In most NLP applications, it makes more sense to train the model on the words in the data rather than the letters. The biggest reason for this is that the words are the main unit of meaning, and the sequence of data the computer has to analyse is shorter. So, for example, ‘the cat sat on the mat’ contains 6 words, but 23 characters (including spaces). The computer has to perform fewer calculations if it is examining only 6 words rather than 23 characters. In addition, in a word-encoding scheme, ‘cat’ and ‘sat’ are simply two different words. If you let the computer see ‘c-a-t’ and ‘s-a-t’, it has a harder learning task, because it won’t know in advance that ‘cat’ and ‘sat’ are totally different: 2/3 of the letters are the same! 3. As an n-gram => one number per sub- or supra-word unit The final main way to encode words is as sub- or super-word units called n-grams. Facebook’s FastText, for example, is an alternative to Word2Vec which encodes each word according to the groups of letters it contains, so that e.g. ‘cat’ would be encoded as [‘cat’, ‘ca’, ‘at’] and ‘there’ would be encoded as [‘there’, ‘th’, ‘the’, ‘her’, ‘ere’, ‘ther’, ‘here’, etc.]. The reason they did this for FastText is that they observed that sometimes the sub-word units contain useful information, for example the ‘an’ prefix in German words like ‘Anwending’ or ‘Anlass’, or the ‘ing’ suffix in English ‘running’, ‘hiking’, etc. Google Translate also uses sub-word units. N-grams can also use super-word units, for example you might encode the sentence ‘the cat sat on the mat’ as a collection of bigrams: ‘the cat’, ‘cat sat’, ‘sat on’, ‘on the’, ‘the mat’. When you encode text as either words or n-grams, you scan through the corpus an identify every unique word or n-gram, creating a dictionary. So for instance, the word ‘cat’ might be word 1000, or the n-gram ‘cat sat’ might be n-gram 10001 in your model’s dictionary. Cheers, Michael _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php