A History of Large Language Models

(gregorygundersen.com)

93 points | by alexmolas 3 days ago

6 comments

jph00 3 hours ago
This is quite a good overview, and parts reflect well how things played out in language model research. It's certainly true that language models and deep learning were not considered particularly promising in NLP, which frustrated me greatly at the time since I knew otherwise!
However the article misses the first two LLMs entirely.
Radford cited CoVE, ELMo, and ULMFiT as the inspirations for GPT. ULMFiT (my paper with Sebastian Ruder) was the only one which actually fine-tuned the full language model for downstream tasks. https://thundergolfer.com/blog/the-first-llm
ULMFiT also pioneered the 3-stage approach of fine-tuning the language model using a causal LM objective and then fine-tuning that with a classification objective, which much later was used in GPT 3.5 instruct, and today is used pretty much everywhere.
The other major oversight in the article is that Dai and Le (2015) is missing -- that pre-dated even ULMFiT in fine-tuning a language model for downstream tasks, but they missed the key insight that a general purpose pretrained model using a large corpus was the critical first step.
It's also missing a key piece of the puzzle regarding attention and transformers: the memory networks paper recently had its 10th birthday and there's a nice writeup of its history here: https://x.com/tesatory/status/1911150652556026328?s=46
It came out about the same time as the Neural Turing Machines paper (https://arxiv.org/abs/1410.5401), covering similar territory -- both pioneered the idea of combining attention and memory in ways later incorporated into transformers.
Al-Khwarizmi 2 hours ago
A great writeup, just let me make two nitpicks (not to diminish the awesome effort of the author, but just in case they wish to take suggestions).
1. I think the paper underemphasizes the relevance of BERT. While from today's LLM-centric perspective it may seem minor because it's in a different branch of the tech tree, it smashed multiple benchmarks at the time and made previous approaches to many NLP analysis tasks immediately obsolete. While I don't much like citation counts as a metric, a testament of its impact is that it has more than 145K citations - in the same order of magnitude as the Transformers paper (197K) and many more than GPT-1 (16K). GPT-1 would ultimately be a landmark paper due to what came afterwards, but at the time it wasn't that useful due to being more oriented to generation (but not that good at it) and, IIRC, not really publicly available (it was technically open source but not posted at a repository or with a framework that allowed you to actually run it). It's also worth remarking that for many NLP tasks that are not generative (things like NER, parsing, sentence/document classification, etc.) often the best alternative is still a BERT-like model even in 2025.
2. The writing kind of implies that modern LLMs were something that was consciously sought after ("the transformer architecture was not enough. Researchers also needed advancements in how these models were trained in order to make the commodity LLMs most people interact with today"). The truth is that no one in the field expected modern LLMs. The story was more like the OpenAI researchers noticing that GPT-2 was good at generating random text that looked fluent, and thought "if we make it bigger it will do that even better". But it turned out that not only it generated better random text, but it started being able to actually state real facts (in spite of the occasional hallucinations), answer questions, translate, be creative, etc. All those emergent abilities that are the basis of "commodity LLMs most people interact with today" were a totally unexpected development. In fact, it is still poorly understood why they work.
[-]
- jph00 9 minutes ago
  (2) is not quite right. I created ULMFiT specifically because I thought a language model pretrained on a large general corpus then fine-tuned was the right way to go for creating generally capable NLP models. It wasn't an accident.
  The fact that, sometime later, GPT-2 could do zero-shot generation was indeed something a lot of folks got excited about, but that was actually not the correct path. The 3-step ULMFiT approach (causal LM training on general corpus then specialised corpus, then classification task fine tuning) was what ChatGPT 3.5 Instruct used, which formed the basis of the first ChatGPT product.
  So although it took quite a while to take off, the idea of the LLM was quite intentional and has largely developed as I planned (even although at the time almost no-one else felt the same way; luckily Alec Radford did, however! He told me in 2018 that reading the ULMFiT paper was a big "omg" moment for him and he set to work on GPT right away.)
  PS: On (1), if I may take a moment to highlight my team's recent work, we updated BERT last year to create ModernBERT, which showed that yes, this approach still has legs. Our models have had >1.5m downloads and there's >2k fine-tunes and variants of it now on Huggingface: https://huggingface.co/models?search=modernbert
- williamtrask 45 minutes ago
  Nit: regarding (2), Phil Blunsom did (same Blunsom from the article, and who was leading language modeling at DeepMind for about 7-8 years). He would often opine at Oxford (where he taught) that solving next word prediction is a viable meta path to AGI. Almost nobody agreed at the time. He also called out early that scaling and better data were the key, and they did end up being, although Google wasn’t as “risk on” as OpenAI on gathering the data for GPT-1/2. Had they been history could easily have been different. People forget the position OAI was in at the time. Elon/funding had left, key talent had left. Risk appetite was high for that kind of thing… and it paid off.
empiko 1 hour ago
What a great write-up, kudos to the author! I’ve been in the field since 2014, so this really feels like reliving my career. I think one paradigm shift that isn’t fully represented in the article is what we now call “genAI.” Sure, we had all kinds of language models (BERTs, word embeddings, etc.), but in the end, most people used them to build customized classifiers or regression models. Nobody was thinking about “solving” tasks by asking oracle-like models questions in natural language. That was considered completely impossible with our technology even in 2018/19. Some people studied language models, but that definitely wasn’t their primary use case; they were mainly used to support tasks like speech-to-text, grammar correction, or similar applications.
With GPT-3 and later ChatGPT, there was a very fundamental shift in how people think about approaching NLP problems. Many of the techniques and methods became outdated and you could suddenly do things that were not feasible before.
[-]
- yobbo 1 hour ago
  > Nobody was thinking about “solving” tasks by asking oracle-like models
  I remember this being talked about maybe even earlier than 2018/2019, but the scale of models then was still off by at least one order of magnitude before it had a chance of working. It was the ridiculous scale of GPT that allowed the insight that scaling would make it useful.
  (Tangentially related; I remember a research project/system from maybe 2010 or earlier that could respond to natural language queries. One of the demos was to ask for distance between cities. It was based on some sort of language parsing and knowledge graph/database, not deep-learning. Would be interesting to read about this again, if anyone remembers.)
- mike_hearn 13 minutes ago
  Are you sure? I wrote an essay at the end of 2016 about the state of AI research and at the time researchers were demolishing benchmarks like FAIR's bAbI which involved generating answers to questions. I wrote back then about story comprehension and programming robots by giving them stories (we'd now call these prompts).
  https://blog.plan99.net/the-science-of-westworld-ec624585e47
  bAbI paper: https://arxiv.org/abs/1502.05698
  Abstract: One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human.
  So at least FAIR was thinking about making AI that you could ask questions of in natural language. Then they went and beat their own benchmark with the Memory Networks paper:
  https://arxiv.org/pdf/1410.3916
  Fred went to the kitchen. Fred picked up the milk. Fred travelled to the office.
  Where is the milk ? A: office
  Where does milk come from ? A: milk come from cow
  What is a cow a type of ? A: cow be female of cattle
  Where are cattle found ? A: cattle farm become widespread in brazil
  What does milk taste like ? A: milk taste like milk
  What does milk go well with ? A: milk go with coffee
  Where was Fred before the office ? A: kitchen
  That was published in 2015. So we can see quite early ChatGPT like capabilities, even though they're quite primitive still.
sreekanth850 1 hour ago
I was wondering on what basis @Sama keeps saying they are near AGI, when in reality LLMs just calculate sequences and probabilities. I really doubt this bubble is going to burst soon.
WolfOliver 58 minutes ago
with what tool was this article written?
brcmthrowaway 3 hours ago
Dumb question, what is the difference between embedding and bag of words?
[-]
- Al-Khwarizmi 2 hours ago
  With bag of words, the representation of a word is a vector whose dimension is the dictionary size, all components are zeros except for the component corresponding to that word, which is one.
  This is not good to train neural networks (because they like to be fed dense, continuous data, not sparse and discrete) and it treats each word as an atomic entity without dealing with relationships between them (you don't have a way to know that the wprds "plane" and "airplane" are more related than "plane" and "dog").
  With word embeddings, you get a space of continuous vectors with a predefined (lower) number of dimensions. This is more useful to serve as input or training data to neural networks, and it is a representation of the meaning space ("plane" and "airplane" will have very similar vectors, while the one for "dog" will be different) which opens up a lot of possibilities to make models and systems more robust.
  [-]