When thinking about #AI, I think it is worth recalling where language models came from. It explains a few things.
The first mention of a language model can be found in a 1983 paper by three IBM employees: Lalit Bahl, Frederick Jelinek and Robert Mercer. The paper sensibly suggests a statistical method to improve automatic speech recognition: whenever the audio system is too bad to distinguish between "*The cat sleep" and the "The cat sleeps", the language model should come to the rescue to say that the grammatical sentence "The cat sleeps" is probabilistically much more likely.
By 1990, with the popularisation of personal computers, IBM is looking for opportunities to sell their mainframes. A research team (incl. Jelinek and Mercer) extends the statistical approach used in speech recognition to machine translation. The paper makes it very clear that the technology relies on sheer computing power -- something that IBM happens to have a lot of and is very keen to sell. It also makes for interesting close-reading: after deploring the so-called 'impotence' of earlier computers, the authors go on introducing mathematical measures such as the evocative 'fertility' to describe how many words the model spawns for each lexical item in the source text. (No women were involved in the making of this paper.)
Statistical language models are properly born, and with them the advent of large text corpora. They prefigure the Large Language Models we are now used to.
Jelinek will become famous for his disdain of linguistics and is often quoted as saying “Every time I fire a linguist, the performance of the speech recognizer goes up.” Mercer will go on donating millions of his personal fortune to the Brexit campaign, the 2016 election of Donald Trump and the super PAC in support of J.D. Vance.
So should we really be surprised when scale is confused with intelligence? When Alex Karp says that AI will be bad for women? Or when prominent technology companies display fascistoid tendencies? It seems to me it was all there at the beginning.
References:
Bahl, L. R., Jelinek, F., & Mercer, R. L. (1983). A maximum likelihood approach to continuous speech recognition. IEEE transactions on pattern analysis and machine intelligence, (2), 179-190.
Brown et al (1990). A statistical approach to machine translation. Computational linguistics, 16(2), 79-85.