Humanist Discussion Group, Vol. 37, No. 76. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org [1] From: Tim Smithers <tim.smithers@cantab.net> Subject: Re: [Humanist] 37.71: studies of algorithmic prejudice: surprised? (177) [2] From: maurizio lana <maurizio.lana@uniupo.it> Subject: Re: [Humanist] 37.71: studies of algorithmic prejudice: surprised? (41) --[1]------------------------------------------------------------------------ Date: 2023-06-05 08:36:30+00:00 From: Tim Smithers <tim.smithers@cantab.net> Subject: Re: [Humanist] 37.71: studies of algorithmic prejudice: surprised? Dear Willard, The word 'bias,' these days, at least, carries quite a lot of pejorative baggage, so I will use the word 'perspective' instead. Every text written by a person, or persons, is, it seems to me, necessarily written from the perspective of the author(s), and it is, I submit, therefore not surprising that this perspective shows up some when other people read what is written. Any pretence of non-perspective writing is just another perspective. There's no escape, though some authors may be better at hiding their perspective than others, and some authors may try more to do so than others. But, could we humans build a machine that generates texts which, when read by humans, would be reasonably judged, by a sensible number of different people, to be perspective-free? I don't know. But given the way human languages work for humans, it's hard, I think, to see how this could be done. I suspect the more perspective-free some text is made to be, the less understandable it becomes to humans. [Has anybody every investigate this in some way?] What we can say, without bias, I think [ -:) ], is that ChatGPT, and other machines like it, definitely do not, and cannot, generate perspective-free texts: they cannot generate bias free texts. The reason for this, as I understand things, is to be found in the answer to the question, what is the model in a Large Language Model (LLM) machine, a model of? First, I want to be clear what I take a model to be: a model is something that can be used, well enough for some purpose, instead of, or in place of, the thing it is a [sufficiently good] model of. We build massive computational models of the Earth's climate system to be able to usefully investigate, and thereby better understand, how the Earth's climate works, and changes, for example. Using these Climate models makes possible investigations that are not possible by just studying -- observing and measuring aspects -- of the real climate. [I'll take it this is an uncontroversial idea of what a model is.] So, the model in a LLM machine, what is it a model of? It is a model of the gigantic collection of [human written] texts that it is said to be 'trained' with -- but which I prefer to say it is programmed with, since this is what I think is really going on when these kinds of machines are built: they are programmed with data; gigantic amounts of data, in this case. The way ChatGPT, and other machines like it, generates text is by trying to complete the text some user inputs, and it does this text-completion job by trying to reconstruct THE text in the gigantic collection of texts that most probably (give or take a bit) -- based upon it's statistical estimates of the probabilities involved -- is the text the user started to re-write. In other words, text generation here is done by finding, in the gigantic collection of texts, the text that is the most probable full completion of the input text. But, there is no way we can turn our gigantic collection of texts into a database which we can then do efficient and accurate text look up in, to find the full text of the one the input text is the beginning of. So, rather than try to use this gigantic collection of text, instead, we use a model of it; a model built in such a way that we can use it in a generative mode, to regenerate, to a sufficient statistical approximation, the corresponding full text completion of the input text. [It's worth noting that, although very big, this model, is massively smaller, in storage and needed computational crunching, than a database of our gigantic collection of text would be, which is way beyond all practical means.] Now, given the statistical nature of this model, just like for any other kind of statistical model used in a generative mode -- think of all those Hidden Markov Models you've built and use in a predictive mode, for example -- when it is used in a generative way, starting with some input text, it mostly does not replicate THE "true" corresponding text in the gigantic collection of texts, it builds what it calculates is the statistically estimated most probable (give or take a bit) full text completion. And, given the way machines like ChatGPT are built, using the Attention mechanism first published by Vaswani et al in 2017, "Attention Is All You Need," this probabilistic incremental reconstruction of the corresponding full text, can, and does, end up as a rather different text, one that mixes or combines in some way, bits and parts of other texts in the gigantic collection. In this way, ChatGPT seems to generate original, never before seen text, and it's true, the texts ChatGPT generates is usually not in our gigantic collection of texts, but something probabilistically close to it will be, albeit, with some big-ish differences in some places sometimes. But we don't care about this. Nobody uses ChatGPT to try to find a good approximation to THE text in the gigantic collection that they have input the beginning of. [And, ChatGPT is not set up to be used to do this.] And nobody, not even the builders of ChatGPT, worry about how well the model works as a model of the gigantic collection of texts. And, of course, it would be very hard work to do this verification and testing: impossible, really. Thus, if ChatGPT can be used to generate any useful texts, it is the result of a side effect of the statistical workings of an un-verified computational model. But, as a model of the gigantic collection of human written texts, it will always, and inevitably, reproduce any biases that are present in any way in the texts of our gigantic collection. When you use ChatGPT, you may think you are giving it some words nobody has ever written before, but, given the model it uses, ChatGPT will treat your input as the beginning of a text in the gigantic collection, and attempt to give you back the most probable existing text that completes your input text, according to the statistical estimates it has, from 'training,' of the probabilities involved. In writing all this, I have, as I'm sure is evident, omitted much important detail of the sophisticated workings of LLM machines like ChatGPT, and said nothing about all the Dark Art needed to actually build these machines. And, of course, this reflects my particular perspective on all this. I would, however, be more than happy to see other perspectives here on the workings of LLM machines like ChatGPT, ones that correct my perspective and understandings, in particular. Here, you'll perhaps be relieved to read, I'll stop generating text -:) Best regards, Tim > On 4 Jun 2023, at 07:49, Humanist <humanist@dhhumanist.org> wrote: > > > Humanist Discussion Group, Vol. 37, No. 71. > Department of Digital Humanities, University of Cologne > Hosted by DH-Cologne > www.dhhumanist.org > Submit to: humanist@dhhumanist.org > > > > > Date: 2023-06-03 05:34:43+00:00 > From: Willard McCarty <willard.mccarty@mccarty.org.uk> > Subject: studies of algorithmic prejudice: surprised? > > Here's a serious follow-up question--with thanks to Tim Smithers, Robin > Burke and others for the responses to my inquiry. Very helpful indeed. > But looking at what I and others have written, I wonder why the > detection and exposure of this (artificially unconscious) prejudice, > however correct and thoroughly pursued, is so unsatisfying? By analogy > to other, older sorts of crime, I wonder why the surprise that something > built by homo sapiens sapiens turns out to bring with it, as it gets > technically better and better, more and more of the imprint of its origins? > And then I wonder about the drive to rigorous perfection and purity in > the digital, frustrated like all those that have preceded it. What is to > be learned from all this? > > That there is quite a role for the digital humanities to play? > > Other questions most welcome. > > Yours, > WM > -- > Willard McCarty, > Professor emeritus, King's College London; > Editor, Interdisciplinary Science Reviews; Humanist > www.mccarty.org.uk --[2]------------------------------------------------------------------------ Date: 2023-06-05 09:50:21+00:00 From: maurizio lana <maurizio.lana@uniupo.it> Subject: Re: [Humanist] 37.71: studies of algorithmic prejudice: surprised? hi Willard, you pose 3 questions: 1. why the detection and exposure of this (artificially unconscious) prejudice, however correct and thoroughly pursued, is so unsatisfying? 2. why the surprise that something built by homo sapiens sapiens turns out to bring with it, as it gets technically better and better, more and more of the imprint of its origins? 3. [why this] drive to rigorous perfection and purity in the digital, frustrated like all those that have preceded it? which are focused towards what i would synthesize as "nothing human has ever been perfect - here nothing different; why do we/they struggle for perfection and purity?" on one side, what i feel is that today "the digital" has an intrinsic, expansive, unstoppable, capacity to shape everything it gets in touch to; hence we need that it be fair, much more than we need it with the laws (!) because laws can be rewritten and modified (we see it everyday) while what has been shaped by the digital does not revert to a previous state. but we need that "the algorithmic" be subjected to the law, before and more that to the ethics. (huge field of discussion, indeed) on the other side i read in your second point ("something built by ...") a matter of fact: traces of the creator are in the creation (of the builder in the building; of the producer in the product; ...). ok. but given that we are discussing about undesirable characteristics of this creation, it seems that you wonder about the presence of an original sin (sort of) in the digital realm which cannot but copy the flaws already present in the physical world. ? Maurizio il pubblico uso della propria ragione deve sempre essere libero Immanuel Kant ------------------------------------------------------------------------ Maurizio Lana UniversitĂ del Piemonte Orientale Dipartimento di Studi Umanistici Piazza Roma 36 - 13100 Vercelli _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php