Humanist Discussion Group, Vol. 36, No. 486. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org [1] From: Maroussia Bednarkiewicz <maroussia.b@gmail.com> Subject: Re: [Humanist] 36.484: numbers for words: advantages? (105) [2] From: Gabor Toth <gabor.toth@maximilianeum.de> Subject: Re: [Humanist] 36.484: numbers for words: advantages? (45) [3] From: Gabor Toth <gabor.toth@maximilianeum.de> Subject: Re: [Humanist] 36.484: numbers for words: advantages? (10) [4] From: Gioele Barabucci <gioele.barabucci@ntnu.no> Subject: Re: [Humanist] 36.484: numbers for words: advantages? (83) [5] From: Jonah Lynch <jonahlynch@iimas.org> Subject: Re: [Humanist] 36.484: numbers for words: advantages? (31) --[1]------------------------------------------------------------------------ Date: 2023-03-27 09:05:01+00:00 From: Maroussia Bednarkiewicz <maroussia.b@gmail.com> Subject: Re: [Humanist] 36.484: numbers for words: advantages? What is being referred to here is called 'word embedding': In order to computationally analyse a text, it is first 'embedded' into a multi-dimensional vector space. The embedding can be initialised at the level of the word, the character or a given unit (for instance the word 'humanist' could be embed as 'human' and '-ist') and the latest techniques also include context. The embedding can be learned and updated in an iterative process to reflect deeper context. The embedding space is usually large (500+ dimensions) which allows to capture a considerable amount of nuances. The embeddings or vectors thus obtained are stored in multi-dimensional matrices which are called 'tensors'. In Natural Language Understanding (in particular Deep Learning), these tensors are multiplied with other tensors called the 'weights' of the model. The model aims at making predictions using these tensors multiplications in an iterative process where at the training phase each prediction output is assessed against true values. If the predictions are inaccurate, the information is 'back-propagated' into the model which will update the weight tensors in order to make more accurate predictions. The process is repeated until the predictions reach a high level of 'accuracy'. Then come a test and validation phases to test the model and validate it. This technique allows to formulate traditional text analysis questions as machine learning prediction problem. The very basic observation that was made initially is the following: words or units (usually called features) used in similar 'contexts' end up embedded close to each other in the embedding (vectorial) space. Hence the initial models were able for instance to make the following prediction: man -> woman king -> ? ? = queen The simplified equation would be: [embedding for king] - [embedding for man] + [embedding for woman] ≃ [embedding for queen] Advanced predictions tasks using a more advanced version of the equation above can predict text syntax, entities (whether a set of words represent a person, an organisation, etc.), sentiments associated with words or expressions, up to whole text generation. Because the embedding captures features of words (particles, but also pixels, etc.) in a vector space, it allows us to perform complex analytical tasks with the help of algorithms (and hence computers). To understand the latest progress in the field, one should look at how the BERT model works: https://en.wikipedia.org/wiki/BERT_(language_model), and the original paper: https://arxiv.org/pdf/1810.04805.pdf. And of course one can explore the original basis for GPT, i.e. the transformers models, explained here https://machinelearningmastery.com/the-transformer-attention-mechanism/ and beautifully illustrated here: https://jalammar.github.io/illustrated-transformer/. The original paper can be found here: https://arxiv.org/pdf/1706.03762.pdf. I hope I could contribute at simplifying this technique of data representation which is quite complex. There is a lot on the topic out there, blog posts like this one: https://machinelearningmastery.com/what-are-word-embeddings/ and many others can help delve deeper into it. Le lun. 27 mars 2023 à 08:54, Humanist <humanist@dhhumanist.org> a écrit : > > Humanist Discussion Group, Vol. 36, No. 484. > Department of Digital Humanities, University of Cologne > Hosted by DH-Cologne > www.dhhumanist.org > Submit to: humanist@dhhumanist.org > > > > > Date: 2023-03-26 14:25:06+00:00 > From: Henry Schaffer <hes@ncsu.edu> > Subject: Using numbers for words? > > I was at a workshop about large scale computer processing with neural > networks/AI and Natural Language Processing (NLP) came up briefly. The > presenter mentioned that typically numbers were substituted for words - but > didn't discuss why. She referred us to > https://www.tensorflow.org/tutorials/text/word2vec as a method, and > there's > some more explanation at https://en.wikipedia.org/wiki/Word2vec > > I can see an advantage in storage and processing speed when dealing with a > word represented as perhaps 2 bytes rather than using perhaps 10-20+ bytes > per word, but I don't see any additional advantage. Do you? > > Representing a word as a vector allows more information to be kept (as in > word2vec) and so that could give other advantages. > > Can anyone add more explanation/reasons? > > --henry > > > _______________________________________________ > Unsubscribe at: http://dhhumanist.org/Restricted > List posts to: humanist@dhhumanist.org > List info and archives at at: http://dhhumanist.org > Listmember interface at: http://dhhumanist.org/Restricted/ > Subscribe at: http://dhhumanist.org/membership_form.php > --[2]------------------------------------------------------------------------ Date: 2023-03-27 07:42:10+00:00 From: Gabor Toth <gabor.toth@maximilianeum.de> Subject: Re: [Humanist] 36.484: numbers for words: advantages? Dear Henry, Vector as a word representation expresses how a given word is related to other words in a given corpus. Thanks to the vector representation the relationship between words can be quantified. The classical example is semantic similarity; you can calculate how similar the meanings of two words are. An illustrative and (highly!) simplified example: In the most simple (and less useful) models, each 'dimension' of a word is another word. For instance, "dog" and "bark" co-occur 10 times in the corpus; and "dog" and "eat" co-occur 8 times in a given corpus. The vector representing "dog" will have value 10 along the dimension "bark" and 8 along the dimension "eat". Imagine that you have the words 'puppy' and 'cat' in the same corpus: a, "puppy" and "bark" co-occur 7 times in the corpus; and "puppy" and "eat" co-occur 8 times in a given corpus. The vector representing "puppy" will have value 7 along the dimension "bark" and 8 along the dimension "eat". b, "cat" and "bark" co-occur 0 times in the corpus; and "cat" and "eat" co-occur 6 times in a given corpus. The vector representing "cat" will have value 0 along the dimension "bark" and 6 along the dimension "eat". Once you have the three words represented as vectors, you can calculate how similar they are with vector calculation (cosine similarity); . "Puppy" and "dog" will be closer than "cat" and "dog" because in case of "cat" the dimension "bark" is 0. Again, this is a simplistic example (in reality we use metrics more complex than the number of times two words co-occur; the dimensions are not necessarily words themselves; we do not necessarily train word2vec model on a given corpus but rather use "ready-made" word embeddings, etc). The core idea is that the meaning of a word can be defined through the relationship of that word to other words in a given corpus (see https://aclweb.org/aclwiki/Distributional_Hypothesis). Vector space representation offers a mathematical formulation of this idea and then allows complex calculations that uncover the relationships between words. Hope this helps a bit, Gabor --[3]------------------------------------------------------------------------ Date: 2023-03-27 15:11:32+00:00 From: Gabor Toth <gabor.toth@maximilianeum.de> Subject: Re: [Humanist] 36.484: numbers for words: advantages? P.S.: For a more elaborate and more precise explanation, see the chapter on word semantics in our book: Barbara McGillivray & Gábor Mihály Tóth: Applying Language Technology in Humanities Research (Design, Application, and the Underlying Logic), https://link.springer.com/book/10.1007/978-3-030-46493-6 --[4]------------------------------------------------------------------------ Date: 2023-03-27 08:16:08+00:00 From: Gioele Barabucci <gioele.barabucci@ntnu.no> Subject: Re: [Humanist] 36.484: numbers for words: advantages? Dear Henry, a sincere thank you for this question. These (seemingly) "naive" questions really force us to reflect on the most basic, often unspoken, assumptions of our field. In the specific analysis performed by word2vec-style algorithms, the content of words, i.e. their letters, is not useful in the creation of the embedding space. What matter is only whether a word is present in a paragraph together with other words. So for the sake of simplifying the computation, it makes sense to keep track of arbitrary indexes rather than of whole words. With the specific case addressed, allow me to digress a bit on the topic of the numerical representation of concepts in computers. When processing data with a digital computer, the question is not _whether_ a word (or any other human concept) will be encoded with a number, but _with which_ number. Digital computers can only process digits, so everything has to be encoded (or represented) as a number. (Even numbers themselves [1,2].) Each numerical representation has its own pros and cons. Most of the time it comes down to balancing two aspects: introspection and calculation of distance. If I encode the words "happy" and "sad" using the ASCII encoding I get the (base 16) numbers 6861707079 and 736164. This representation is optimized for introspection. It allows me to quickly check whether a word contains the letter "a": does it's numerical representation contains the (base 16) number "61"? But suppose now that the question I'm after is instead "how distant is word X from word Y in lexicographical order"? Then this representation does not carry enough information to answer this question. At least not in a precise manner. A better representation to answer this question would be 52335 for "happy" and 82238 for "sad", i.e. their positions in the standard UNIX English dictionary [3]. Now I can easily calculate the lexicographic distance between them: 82238 - 52335 = 29903 ≈ 30% of the size of the dictionary. This representation is optimized for the calculation of a specific kind of distance, but loses me the ability to easily compute if the word represented by "82238" contains the letter "a". Tradeoffs. In the case of word2vec or, more widely, in the case of multidimensional embeddings, words (and concepts in general) are transformed into numbers of a multidimensional space, i.e. vectors, with a peculiar property: their relative position in that multidimensional space is a proxy for another quality. For instance, in word2vec the nearer two numbers are to each other, the more the words that they represent are similar in meaning. (The definition of "similar" is a whole other story and is specific to the word2vec variant used.) Other qualities such as which letters are in those words or how long these words are are deemed irrelevant, so they are not taken into account when constructing the underlying embedding space. (See fastText for an example of a different approach [4].) I'd like to point out that the transformation of words and concepts into specific kinds of numbers that allow us to better comprehend and compare some of their qualities is not unique to the world of computers. For instance countries are routinely compared based on some of their parameters like GDP and life expectancy [5] or percentage of obese people and daily caloric intake [6]. Similarly to what happens in computer representations, here the countries are reduced to a couple of numbers. Not random numbers, but rather numbers whose comparison allows to answer specific questions. Regards, [1] https://en.wikipedia.org/wiki/Numerical_tower [2] https://en.wikipedia.org/wiki/Two%27s_complement [3] nl /usr/share/dict/words | grep -E '\s(happy|sad)$' [4] https://arxiv.org/abs/1607.04606 [5] https://www.cfr.org/sites/default/files/image/2022/09/graphic_og.png [6] https://www.economicshelp.org/wp-content/uploads/2020/02/relationship-between- obesity-calories.png -- Prof. Dr. Gioele Barabucci <gioele.barabucci@ntnu.no> Associate Professor of Computer Science NTNU — Norwegian University of Science and Technology --[5]------------------------------------------------------------------------ Date: 2023-03-27 07:40:43+00:00 From: Jonah Lynch <jonahlynch@iimas.org> Subject: Re: [Humanist] 36.484: numbers for words: advantages? Hi Henry, in my work, I make a lot of use of Word2Vec and similar methods. One reason it is useful to “translate” words into numbers is because numbers can be manipulated and visualized in ways that words resist. There are lots of caveats to this affirmation, but for starters take a look at the word2vec paper (https://arxiv.org/pdf/1301.3781.pdf <https://arxiv.org/pdf/1301.3781.pdf>), where you will find this phrase: "vector(”King”) - vector(”Man”) + vector(”Woman”) results in a vector that is closest to the vector representation of the word Queen [20].” A word algebra, where semantic relations are to some extent preserved in mathematical relations, is a powerful addition to the tools of linguistics and philosophy. Vector representations can be used to calculate semantic similarity between words that do not share a root (“car” and “auto”, for a simple instance); they can be used to visualize complex data (see https://pair- code.github.io/understanding-umap/ <https://pair-code.github.io/understanding- umap/>), and much more besides. The compression factor you mention is actually the opposite: a word can be represented in English with a few bytes, whereas in vector form it usually takes much more information. The OpenAI word embeddings (=vector representations of text) that I currently use are 1536 dimensional vectors, which are much more storage-intensive than the ASCII codes that represent the letters in this email. But that extra room represents (some of) the information that your brain contains regarding the words you are reading. By having more information available in a computer-processable form, we can automate some operations that heretofore could not be automated. Hope that helps, Jonah _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php