IBM’s statistical machine translation models pioneered word-alignment techniques, helping establish corpus-based approaches that later influenced large-scale language modeling.
The number of publications about large language models by year grouped by publication types
Researchers began using neural networks for language modeling, marking an early shift away from purely count-based n-gram approaches.
Smoothed n-gram models (including Kneser–Ney smoothing) trained on roughly 300 million words achieved state-of-the-art perplexity on benchmark tests, illustrating the power of large corpora even before deep learning took over.
With widespread internet access, researchers increasingly compiled massive web text datasets (“web as corpus”) to train larger statistical language models.
After deep neural networks succeeded in image classification, similar deep-learning architectures were adapted for language tasks, accelerating progress in neural language modeling.
Word embeddings (for example, Word2Vec) became a key building block for neural NLP systems by representing words as dense vectors learned from large corpora.
The attention mechanism (Bahdanau et al.) was introduced for seq2seq models, enabling better handling of long-range dependencies and influencing later transformer designs.
Google switched its translation service to neural machine translation, replacing statistical phrase-based translation with deep recurrent neural networks using LSTM encoder–decoder architectures.
Google researchers introduced the transformer architecture in the NeurIPS paper “Attention Is All You Need,” establishing the core approach behind most modern large language models.
The training compute of notable large models in FLOPs vs publication date over the period 2010–2024. For overall notable models (top left), frontier models (top right), top language models (bottom left) and top models within leading companies (bottom right). The majority of these models are language models.
Mixture-of-experts (MoE) architectures were introduced by Google researchers, using gated routing to activate only a subset of model parameters per input to reduce inference costs.
BERT was introduced and rapidly became widely used in research and applications, popularizing transformer-based pretraining for language understanding tasks.
GPT-1 (decoder-only) was introduced, helping establish the autoregressive transformer approach that later scaled into widely deployed LLMs.
GPT-2 drew major attention after OpenAI initially withheld full release due to concerns about malicious use, intensifying public debate about LLM risks.
Training-cost comparisons highlighted rapidly rising compute requirements: GPT-2 (1.5B parameters) was estimated to cost about $50,000 to train.
GPT-3 was released and, as of 2025, remained available only via API (not as downloadable weights), reflecting a trend toward controlled deployment of frontier models.
Few-shot prompting was demonstrated with GPT-3, showing that models could adapt to tasks from examples in the prompt without explicit fine-tuning.
Megatron-Turing NLG 530B was estimated to cost around $11 million to train, illustrating the scale of infrastructure needed for frontier LLMs.
PaLM (540B parameters) was estimated to cost about $8 million to train, further demonstrating escalating training costs at very large scales.
ChatGPT’s consumer-facing release brought LLMs to broad public attention and drove rapid uptake across research areas such as robotics, software engineering, and societal impact work.
OpenAI demonstrated InstructGPT, using instruction fine-tuning (and related methods such as RLHF) to make GPT-style models better at following user instructions.
Prompt chaining and chain-of-thought prompting were introduced as methods to improve performance on multi-step problems by structuring intermediate reasoning steps.
Research usage of BERT began to decline as decoder-only GPT-style models improved at solving tasks via prompting, shifting mainstream practice toward instruction- and chat-oriented LLMs.
Open-weight models gained popularity, with prominent early examples including BLOOM and LLaMA (despite restrictions), broadening access to strong LLMs outside a few companies.
A study found that prompting ChatGPT 3.5 turbo to repeat a word many times could lead it to output excerpts from its training data, highlighting memorization risks.
A comparison of LLM fact-checking performance found moderate proficiency, with GPT-4 reported as achieving the highest accuracy among tested models (71%), still behind human fact-checkers.
Google introduced Gemini 1.5 with a context window reported to reach up to 1 million tokens, illustrating rapid expansion of workable context length for transformer models.
When each head calculates, according to its own criteria, how much other tokens are relevant for the "it_" token, note that the second attention head, represented by the second column, is focusing most on the first two rows, i.e. the tokens "The" and "animal", while the third column is focusing most on the bottom two rows, i.e. on "tired", which has been tokenized into two tokens.54
GPT-4 was widely praised for improved accuracy and multimodal capabilities; OpenAI did not disclose its high-level architecture or parameter count.
The training compute of notable large AI models in FLOPs vs publication date over the period 2017–2024. The majority of large models are language models or multimodal models with language capacity.
OpenAI released the reasoning model OpenAI o1, described as generating long chains of thought before returning a final answer, reflecting a shift toward model-native reasoning approaches.
“Reasoning models” emerged as a recognized approach to LLM development, training models to produce step-by-step analysis before final answers to improve results on complex tasks.
Studies summarized in the article reported energy costs per prompt varying widely by task type, with image generation far more energy-intensive than typical text classification or generation.
According to research institute Epoch AI, energy consumption per typical ChatGPT query (0.3 watt-hours) is small compared to the average U.S. household consumption per minute (almost 20 watt-hours).209
Examples of non-transformer approaches (including recurrent variants and Mamba, a state-space model) were noted alongside continued transformer dominance in top-performing models.
DeepSeek released DeepSeek-R1, a 671B-parameter open-weight reasoning model reported to perform comparably to OpenAI o1 at a lower price per token, increasing attention to open-weight high-performance models.
The article notes that open-weight LLMs increasingly shaped the field since 2023, with research arguing for openness that includes inclusiveness, accountability, and ethical responsibility, not just code or weights.
OpenAI’s o3 followed o1 as another step in the “reasoning model” line, reflecting continued emphasis on step-by-step analysis for difficult tasks.
Prompt injection was highlighted as a significant risk when LLM-based agentic features have access to private data, motivating mitigations such as input sanitization and model auditing.
Legal and commercial disputes over training data and memorization accelerated, including major settlements and court decisions that depend on factual details of data acquisition, retention, and fair-use arguments.