Humanist Discussion Group

Humanist Discussion Group, Vol. 36, No. 489.
Department of Digital Humanities, University of Cologne
Hosted by DH-Cologne
www.dhhumanist.org
Submit to: humanist@dhhumanist.org

Date: 2023-03-29 03:44:11+00:00
From: Michael Falk <michaelgfalk@gmail.com>
Subject: Re: [Humanist] 36.484: numbers for words: advantages?

Hi Henry,

The reason numbers are substituted for words is that words are the unit of
analysis. It’s worth remembering that in fact words are *always* represented by
numbers in any statistical model. Statistical models analyse words using
mathematics, and need to be able to add, subtract, multiply and divide them in
order to produce an output.

Basically we can distinguish three possible numerical representations for a
word:

1. As a string (like you’re reading now) => one number per letter

In the simplest case each letter is represented by a number. For example, in
ASCII, each letter is represented by a number in the range 0-127. So ‘a’ is
represented by the number 97. This roughly corresponds to 1 byte per word. You
are probably reading this in UNICODE, however, where each character is
represented by 1-4 bytes, allowing for millions of different letters to be
represented.

2. As a word => one number per word

In most NLP applications, it makes more sense to train the model on the words in
the data rather than the letters. The biggest reason for this is that the words
are the main unit of meaning, and the sequence of data the computer has to
analyse is shorter. So, for example, ‘the cat sat on the mat’ contains 6 words,
but 23 characters (including spaces). The computer has to perform fewer
calculations if it is examining only 6 words rather than 23 characters. In
addition, in a word-encoding scheme, ‘cat’ and ‘sat’ are simply two different
words. If you let the computer see ‘c-a-t’ and ‘s-a-t’, it has a harder learning
task, because it won’t know in advance that ‘cat’ and ‘sat’ are totally
different: 2/3 of the letters are the same!

3. As an n-gram => one number per sub- or supra-word unit

The final main way to encode words is as sub- or super-word units called
n-grams. Facebook’s FastText, for example, is an alternative to Word2Vec which
encodes each word according to the groups of letters it contains, so that e.g.
‘cat’ would be encoded as [‘cat’, ‘ca’, ‘at’] and ‘there’ would be encoded as
[‘there’, ‘th’, ‘the’, ‘her’, ‘ere’, ‘ther’, ‘here’, etc.]. The reason they did
this for FastText is that they observed that sometimes the sub-word units
contain useful information, for example the ‘an’ prefix in German words like
‘Anwending’ or ‘Anlass’, or the ‘ing’ suffix in English ‘running’, ‘hiking’,
etc. Google Translate also uses sub-word units. N-grams can also use super-word
units, for example you might encode the sentence ‘the cat sat on the mat’ as a
collection of bigrams: ‘the cat’, ‘cat sat’, ‘sat on’, ‘on the’, ‘the mat’.

When you encode text as either words or n-grams, you scan through the corpus an
identify every unique word or n-gram, creating a dictionary. So for instance,
the word ‘cat’ might be word 1000, or the n-gram ‘cat sat’ might be n-gram 10001
in your model’s dictionary.

Cheers,

Michael

_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php