Humanist Discussion Group, Vol. 36, No. 504. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org [1] From: Michael Falk <michaelgfalk@gmail.com> Subject: Re: [Humanist] 36.496: numbers for words (60) [2] From: Maroussia Bednarkiewicz <maroussia.b@gmail.com> Subject: Re: [Humanist] 36.496: numbers for words (44) --[1]------------------------------------------------------------------------ Date: 2023-03-30 10:04:55+00:00 From: Michael Falk <michaelgfalk@gmail.com> Subject: Re: [Humanist] 36.496: numbers for words Great question re: polysemy. The short answer is “yes.” The slightly longer answer is: there are different ways of representing polysemy computationally. I will just compare two examples: word vectors (e.g. Word2Vec or FastText) and LDA topic models. In a word vector model, each word is represented by a vector of numbers. A typical model might assign a vector of fifty, 100 or 150 numbers to each word. The individual numbers in the vector have no human meaning. They are basically just an arbitrary set of numbers that represent what the computer has learned about it how a particular word is used. If a word is used in several different senses in the corpus, then this in principle will be encoded in the set of numbers somehow. How does a word vector model realise that a word is used in several senses? It depends on the training algorithm used, but a simple example is the “skip-gram” model. In such a model, the computer tries to learn what words appear on either side of a given word. Since a word will tend to have different “neighbours” when used in different senses, the computer should notice this and somehow encode the information in the vector of numbers for that word. A topic model stores information about polysemy in a different way. A topic model also represents each word by a vector of numbers, but in this case each number represents the probability of the word being assigned to a given topic. For example, imagine that you train a topic model to find 2 topics in a given corpus (typically you would search for 10s or 100s of topics, but let’s simplify the example). The question is: if a word is assigned to topic x, how likely is it to be word y? Imagine that your corpus contained many texts about embroidery and many texts about cookery and let’s say that the computer correctly managed to distinguish embroidery discourse from cookery discourse. If a word were assigned to the embroidery topic, then the probability that the word is “appliqué” might be 0.005 (5 in every 1000 words in embroidery texts are the word “appliqué”). The probability of “appliqué” appearing in a cookery text is presumably lower. By contrast, you occasionally stitch food in cooking, so we might imagine that “stitch” could have a probability of 0.007 in the embroidery topic and 0.0007 in the cookery topic, for example. Likewise if there were sport texts in the corpus, the model should be able to work out that the word “stitch” appears in sporting discourse, in the sense of “cramp” (is that an Australianism?). I guess the example of “stitch” does raise the question of what counts as polysemy. You could say that “stitch” has the same meaning in both cases. But topic models are certainly capable of encoding polysemy of other kinds, as I hope the sport example can help you to imagine. Of course, if you mean polysemy in the original sense used by Dante, then I know of no model that can do that! There many be one, but it is far beyond my ken. Michael -- Michael Falk, PhD, FHEA Postdoctoral Research Associate Wikipedia and the Nation’s Story | wikihistories.net University of Technology Sydney Sent from my mobile phone. --[2]------------------------------------------------------------------------ Date: 2023-03-30 08:20:28+00:00 From: Maroussia Bednarkiewicz <maroussia.b@gmail.com> Subject: Re: [Humanist] 36.496: numbers for words Hi, Thank you so much for this interesting discussion! One should add to Micheal's division of word representation from string, word and n-gram a different division that prevails nowadays: the division between static embedding and dynamic embedding. ---Static Embedding--- Word embeddings, known as 'static embeddings' offered by models like Word2Vec (GloVe, FastText) will not account for polysemy. They represent text units (whether strings, words, sub-words or characters) in a static way without their contexts. ---Dynamic Embedding--- The kind of more advanced embeddings I described which represent a word/sub-word/etc. *and* its context will recognise different usage of the considered unit if enough examples of all the different contexts are provided to the model in the training dataset. Hence the embeddings, i.e. word and context representations, will represent as many 'fathers' as they appear in different contexts (can be more than 6, i.e. there can be more embeddings to father than its six entries in Miriam Webster). This is why they are called dynamic. In a post-processing step one could for example gather all the instances of 'father' and perform a classification task to account for their different categories or contexts. (I prefer not to use 'n-gram' in the case of dynamic embeddings because n-gram can give the false idea of a fixed number [n] of characters/words/sub-words, which is not the case in dynamic embeddings used by BERT or GPT) This blog by Stanford AI does a very good job at explaining contextual embeddings and their advantages: https://ai.stanford.edu/blog/contextual/. The distinction between static and dynamic embedding is crucial today with the latest models: for example Micheal's point on character embedding is valid only for static embedding, as shown in this paper about CharBERT, a model using dynamic character embeddings which produced good results on different NLP tasks (question answering, sequence labeling, and text classification tasks): https://arxiv.org/pdf/2011.01513.pdf. I look forward to reading Micheal's experience on text representations and polysemy! _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php