Humanist Discussion Group

				
              Humanist Discussion Group, Vol. 36, No. 486.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org


    [1]    From: Maroussia Bednarkiewicz <maroussia.b@gmail.com>
           Subject: Re: [Humanist] 36.484: numbers for words: advantages? (105)

    [2]    From: Gabor Toth <gabor.toth@maximilianeum.de>
           Subject: Re: [Humanist] 36.484: numbers for words: advantages? (45)

    [3]    From: Gabor Toth <gabor.toth@maximilianeum.de>
           Subject: Re: [Humanist] 36.484: numbers for words: advantages? (10)

    [4]    From: Gioele Barabucci <gioele.barabucci@ntnu.no>
           Subject: Re: [Humanist] 36.484: numbers for words: advantages? (83)

    [5]    From: Jonah Lynch <jonahlynch@iimas.org>
           Subject: Re: [Humanist] 36.484: numbers for words: advantages? (31)


--[1]------------------------------------------------------------------------
        Date: 2023-03-27 09:05:01+00:00
        From: Maroussia Bednarkiewicz <maroussia.b@gmail.com>
        Subject: Re: [Humanist] 36.484: numbers for words: advantages?

What is being referred to here is called 'word embedding': In order to
computationally analyse a text, it is first 'embedded' into a
multi-dimensional vector space. The embedding can be initialised at the
level of the word, the character or a given unit (for instance the word
'humanist' could be embed as 'human' and '-ist') and the latest techniques
also include context. The embedding can be learned and updated in an
iterative process to reflect deeper context. The embedding space is usually
large (500+ dimensions) which allows to capture a considerable amount of
nuances. The embeddings or vectors thus obtained are stored in
multi-dimensional matrices which are called 'tensors'.

In Natural Language Understanding (in particular Deep Learning), these
tensors are multiplied with other tensors called the 'weights' of the
model. The model aims at making predictions using these tensors
multiplications in an iterative process where at the training phase each
prediction output is assessed against true values. If the predictions are
inaccurate, the information is 'back-propagated' into the model which will
update the weight tensors in order to make more accurate predictions. The
process is repeated until the predictions reach a high level of 'accuracy'.
Then come a test and validation phases to test the model and validate it.

This technique allows to formulate traditional text analysis questions as
machine learning prediction problem. The very basic observation that was
made initially is the following: words or units (usually called features)
used in similar 'contexts' end up embedded close to each other in the
embedding (vectorial) space. Hence the initial models were able for
instance to make the following prediction:

man -> woman
king -> ?

? = queen

The simplified equation would be:

[embedding for king] - [embedding for man] + [embedding for woman] ≃
[embedding for queen]

Advanced predictions tasks using a more advanced version of the equation
above can predict text syntax, entities (whether a set of words represent a
person, an organisation, etc.), sentiments associated with words or
expressions, up to whole text generation. Because the embedding captures
features of words (particles, but also pixels, etc.) in a vector space, it
allows us to perform complex analytical tasks with the help of algorithms
(and hence computers).

To understand the latest progress in the field, one should look at how the
BERT model works: https://en.wikipedia.org/wiki/BERT_(language_model), and
the original paper: https://arxiv.org/pdf/1810.04805.pdf. And of course one
can explore the original basis for GPT, i.e. the transformers models,
explained here
https://machinelearningmastery.com/the-transformer-attention-mechanism/ and
beautifully illustrated here:
https://jalammar.github.io/illustrated-transformer/. The original paper can
be found here: https://arxiv.org/pdf/1706.03762.pdf.

I hope I could contribute at simplifying this technique of data
representation which is quite complex. There is a lot on the topic out
there, blog posts like this one:
https://machinelearningmastery.com/what-are-word-embeddings/ and many
others can help delve deeper into it.

Le lun. 27 mars 2023 à 08:54, Humanist <humanist@dhhumanist.org> a écrit :

>
>               Humanist Discussion Group, Vol. 36, No. 484.
>         Department of Digital Humanities, University of Cologne
>                       Hosted by DH-Cologne
>                        www.dhhumanist.org
>                 Submit to: humanist@dhhumanist.org
>
>
>
>
>         Date: 2023-03-26 14:25:06+00:00
>         From: Henry Schaffer <hes@ncsu.edu>
>         Subject: Using numbers for words?
>
> I was at a workshop about large scale computer processing with neural
> networks/AI and Natural Language Processing (NLP) came up briefly. The
> presenter mentioned that typically numbers were substituted for words - but
> didn't discuss why. She referred us to
> https://www.tensorflow.org/tutorials/text/word2vec as a method, and
> there's
> some more explanation at https://en.wikipedia.org/wiki/Word2vec
>
> I can see an advantage in storage and processing speed when dealing with a
> word represented as perhaps 2 bytes rather than using perhaps 10-20+ bytes
> per word, but I don't see any additional advantage. Do you?
>
> Representing a word as a vector allows more information to be kept (as in
> word2vec) and so that could give other advantages.
>
> Can anyone add more explanation/reasons?
>
> --henry
>
>
> _______________________________________________
> Unsubscribe at: http://dhhumanist.org/Restricted
> List posts to: humanist@dhhumanist.org
> List info and archives at at: http://dhhumanist.org
> Listmember interface at: http://dhhumanist.org/Restricted/
> Subscribe at: http://dhhumanist.org/membership_form.php
>

--[2]------------------------------------------------------------------------
        Date: 2023-03-27 07:42:10+00:00
        From: Gabor Toth <gabor.toth@maximilianeum.de>
        Subject: Re: [Humanist] 36.484: numbers for words: advantages?

Dear Henry,

Vector as a word representation expresses how a given word is related to
other words in a given corpus. Thanks to the vector representation the
relationship between words can be quantified. The classical example is
semantic similarity; you can calculate how similar the meanings of two
words are.

An illustrative and (highly!) simplified example:

In the most simple (and less useful) models, each 'dimension' of a word is
another word. For instance, "dog" and "bark" co-occur 10 times in the
corpus;  and "dog" and "eat" co-occur 8 times in a given corpus. The vector
representing "dog" will have value 10 along the dimension "bark" and 8
along the dimension "eat".

Imagine that you have the words 'puppy'  and 'cat' in the same corpus:

 a, "puppy" and "bark" co-occur 7 times in the corpus;  and "puppy" and
"eat" co-occur 8 times in a given corpus. The vector representing "puppy"
will have value 7 along the dimension "bark" and 8 along the dimension
"eat".
b, "cat" and "bark" co-occur 0 times in the corpus;  and "cat" and "eat"
co-occur 6 times in a given corpus. The vector representing "cat" will have
value 0 along the dimension "bark" and 6 along the dimension "eat".

Once you have the three words represented as vectors, you can calculate how
similar they are with vector calculation (cosine similarity); . "Puppy" and
"dog" will be closer than "cat" and "dog" because in case of "cat" the
dimension "bark" is 0.

Again, this is a simplistic example (in reality we use metrics more complex
than the number of times two words co-occur; the dimensions are not
necessarily words themselves; we do not necessarily train word2vec model on
a given corpus but rather use "ready-made" word embeddings, etc). The core
idea is that the meaning of a word can be defined through the relationship
of that word to other words in a given corpus (see
https://aclweb.org/aclwiki/Distributional_Hypothesis). Vector space
representation offers a mathematical formulation of this idea and then
allows complex calculations that uncover the relationships between words.

Hope this helps a bit,

Gabor


--[3]------------------------------------------------------------------------
        Date: 2023-03-27 15:11:32+00:00
        From: Gabor Toth <gabor.toth@maximilianeum.de>
        Subject: Re: [Humanist] 36.484: numbers for words: advantages?

P.S.: For a more elaborate and more precise explanation, see the chapter on
word semantics in our book:

Barbara McGillivray & Gábor Mihály Tóth: Applying Language Technology in
Humanities Research (Design, Application, and the Underlying Logic),
https://link.springer.com/book/10.1007/978-3-030-46493-6





--[4]------------------------------------------------------------------------
        Date: 2023-03-27 08:16:08+00:00
        From: Gioele Barabucci <gioele.barabucci@ntnu.no>
        Subject: Re: [Humanist] 36.484: numbers for words: advantages?

Dear Henry, a sincere thank you for this question. These (seemingly)
"naive" questions really force us to reflect on the most basic, often
unspoken, assumptions of our field.

In the specific analysis performed by word2vec-style algorithms, the
content of words, i.e. their letters, is not useful in the creation of
the embedding space. What matter is only whether a word is present in a
paragraph together with other words. So for the sake of simplifying the
computation, it makes sense to keep track of arbitrary indexes rather
than of whole words.

With the specific case addressed, allow me to digress a bit on the topic
of the numerical representation of concepts in computers.

When processing data with a digital computer, the question is not
_whether_ a word (or any other human concept) will be encoded with a
number, but _with which_ number.

Digital computers can only process digits, so everything has to be
encoded (or represented) as a number. (Even numbers themselves [1,2].)

Each numerical representation has its own pros and cons. Most of the
time it comes down to balancing two aspects: introspection and
calculation of distance.

If I encode the words "happy" and "sad" using the ASCII encoding I get
the (base 16) numbers 6861707079 and 736164. This representation is
optimized for introspection. It allows me to quickly check whether a
word contains the letter "a": does it's numerical representation
contains the (base 16) number "61"?

But suppose now that the question I'm after is instead "how distant is
word X from word Y in lexicographical order"? Then this representation
does not carry enough information to answer this question. At least not
in a precise manner.

A better representation to answer this question would be 52335 for
"happy" and 82238 for "sad", i.e. their positions in the standard UNIX
English dictionary [3]. Now I can easily calculate the lexicographic
distance between them: 82238 - 52335 = 29903 ≈ 30% of the size of the
dictionary. This representation is optimized for the calculation of a
specific kind of distance, but loses me the ability to easily compute if
the word represented by "82238" contains the letter "a". Tradeoffs.

In the case of word2vec or, more widely, in the case of
multidimensional embeddings, words (and concepts in general) are
transformed into numbers of a multidimensional space, i.e. vectors, with
a peculiar property: their relative position in that multidimensional
space is a proxy for another quality. For instance, in word2vec the
nearer two numbers are to each other, the more the words that they
represent are similar in meaning. (The definition of "similar" is a
whole other story and is specific to the word2vec variant used.) Other
qualities such as which letters are in those words or how long these
words are are deemed irrelevant, so they are not taken into account when
constructing the underlying embedding space. (See fastText for an
example of a different approach [4].)

I'd like to point out that the transformation of words and concepts into
specific kinds of numbers that allow us to better comprehend and compare
some of their qualities is not unique to the world of computers. For
instance countries are routinely compared based on some of their
parameters like GDP and life expectancy [5] or percentage of obese
people and daily caloric intake [6].

Similarly to what happens in computer representations, here the
countries are reduced to a couple of numbers. Not random numbers, but
rather numbers whose comparison allows to answer specific questions.

Regards,

[1] https://en.wikipedia.org/wiki/Numerical_tower
[2] https://en.wikipedia.org/wiki/Two%27s_complement
[3] nl /usr/share/dict/words | grep -E '\s(happy|sad)$'
[4] https://arxiv.org/abs/1607.04606
[5] https://www.cfr.org/sites/default/files/image/2022/09/graphic_og.png
[6]
https://www.economicshelp.org/wp-content/uploads/2020/02/relationship-between-
obesity-calories.png

--
Prof. Dr. Gioele Barabucci <gioele.barabucci@ntnu.no>
Associate Professor of Computer Science
NTNU — Norwegian University of Science and Technology

--[5]------------------------------------------------------------------------
        Date: 2023-03-27 07:40:43+00:00
        From: Jonah Lynch <jonahlynch@iimas.org>
        Subject: Re: [Humanist] 36.484: numbers for words: advantages?

Hi Henry,

in my work, I make a lot of use of Word2Vec and similar methods. One reason it
is useful to “translate” words into numbers is because numbers can be
manipulated and visualized in ways that words resist. There are lots of caveats
to this affirmation, but for starters take a look at the word2vec paper
(https://arxiv.org/pdf/1301.3781.pdf <https://arxiv.org/pdf/1301.3781.pdf>),
where you will find this phrase: "vector(”King”) - vector(”Man”) +
vector(”Woman”) results in a vector that is closest to the vector representation
of the word Queen [20].” A word algebra, where semantic relations are to some
extent preserved in mathematical relations, is a powerful addition to the tools
of linguistics and philosophy.
Vector representations can be used to calculate semantic similarity between
words that do not share a root (“car” and “auto”, for a simple instance); they
can be used to visualize complex data (see https://pair-
code.github.io/understanding-umap/ <https://pair-code.github.io/understanding-
umap/>), and much more besides.

The compression factor you mention is actually the opposite: a word can be
represented in English with a few bytes, whereas in vector form it usually takes
much more information. The OpenAI word embeddings (=vector representations of
text) that I currently use are 1536 dimensional vectors, which are much more
storage-intensive than the ASCII codes that represent the letters in this email.
But that extra room represents (some of) the information that your brain
contains regarding the words you are reading. By having more information
available in a computer-processable form, we can automate some operations that
heretofore could not be automated.

Hope that helps,

Jonah


_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php