Timeline of large language model development

A large language model (LLM) is a computational model trained on a vast amount of data, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the core capabilities of modern chatbots. LLMs can be fine-tuned for specific tasks or guided by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained on.

Early 1990s

IBM’s statistical machine translation models pioneered word-alignment techniques, helping establish corpus-based approaches that later influenced large-scale language modeling.

The number of publications about large language models by year grouped by publication types

The number of publications about large language models by year grouped by publication types

2000

Researchers began using neural networks for language modeling, marking an early shift away from purely count-based n-gram approaches.

2001

Smoothed n-gram models (including Kneser–Ney smoothing) trained on roughly 300 million words achieved state-of-the-art perplexity on benchmark tests, illustrating the power of large corpora even before deep learning took over.

2000s

With widespread internet access, researchers increasingly compiled massive web text datasets (“web as corpus”) to train larger statistical language models.

~2012

After deep neural networks succeeded in image classification, similar deep-learning architectures were adapted for language tasks, accelerating progress in neural language modeling.

2013

Word embeddings (for example, Word2Vec) became a key building block for neural NLP systems by representing words as dense vectors learned from large corpora.

2014

The attention mechanism (Bahdanau et al.) was introduced for seq2seq models, enabling better handling of long-range dependencies and influencing later transformer designs.

2016

Google switched its translation service to neural machine translation, replacing statistical phrase-based translation with deep recurrent neural networks using LSTM encoder–decoder architectures.

2017

Google researchers introduced the transformer architecture in the NeurIPS paper “Attention Is All You Need,” establishing the core approach behind most modern large language models.

The training compute of notable large models in FLOPs vs publication date over the period 2010–2024. For overall notable models (top left), frontier models (top right), top language models (bottom left) and top models within leading companies (bottom right). The majority of these models are language models.

The training compute of notable large models in FLOPs vs publication date over the period 2010–2024. For overall notable models (top left), frontier models (top right), top language models (bottom left) and top models within leading companies (bottom right). The majority of these models are language models.

2017

Mixture-of-experts (MoE) architectures were introduced by Google researchers, using gated routing to activate only a subset of model parameters per input to reduce inference costs.

2018

BERT was introduced and rapidly became widely used in research and applications, popularizing transformer-based pretraining for language understanding tasks.

2018

GPT-1 (decoder-only) was introduced, helping establish the autoregressive transformer approach that later scaled into widely deployed LLMs.

2019

GPT-2 drew major attention after OpenAI initially withheld full release due to concerns about malicious use, intensifying public debate about LLM risks.

2019

Training-cost comparisons highlighted rapidly rising compute requirements: GPT-2 (1.5B parameters) was estimated to cost about $50,000 to train.

image
2020

GPT-3 was released and, as of 2025, remained available only via API (not as downloadable weights), reflecting a trend toward controlled deployment of frontier models.

2020

Few-shot prompting was demonstrated with GPT-3, showing that models could adapt to tasks from examples in the prompt without explicit fine-tuning.

2021

Megatron-Turing NLG 530B was estimated to cost around $11 million to train, illustrating the scale of infrastructure needed for frontier LLMs.

2022

PaLM (540B parameters) was estimated to cost about $8 million to train, further demonstrating escalating training costs at very large scales.

2022

ChatGPT’s consumer-facing release brought LLMs to broad public attention and drove rapid uptake across research areas such as robotics, software engineering, and societal impact work.

2022

OpenAI demonstrated InstructGPT, using instruction fine-tuning (and related methods such as RLHF) to make GPT-style models better at following user instructions.

2022

Prompt chaining and chain-of-thought prompting were introduced as methods to improve performance on multi-step problems by structuring intermediate reasoning steps.

2023

Research usage of BERT began to decline as decoder-only GPT-style models improved at solving tasks via prompting, shifting mainstream practice toward instruction- and chat-oriented LLMs.

2023

Open-weight models gained popularity, with prominent early examples including BLOOM and LLaMA (despite restrictions), broadening access to strong LLMs outside a few companies.

2023

A study found that prompting ChatGPT 3.5 turbo to repeat a word many times could lead it to output excerpts from its training data, highlighting memorization risks.

2023

A comparison of LLM fact-checking performance found moderate proficiency, with GPT-4 reported as achieving the highest accuracy among tested models (71%), still behind human fact-checkers.

2024

Google introduced Gemini 1.5 with a context window reported to reach up to 1 million tokens, illustrating rapid expansion of workable context length for transformer models.

When each head calculates, according to its own criteria, how much other tokens are relevant for the "it_" token, note that the second attention head, represented by the second column, is focusing most on the first two rows, i.e. the tokens "The" and "animal", while the third column is focusing most on the bottom two rows, i.e. on "tired", which has been tokenized into two tokens.54

When each head calculates, according to its own criteria, how much other tokens are relevant for the "it_" token, note that the second attention head, represented by the second column, is focusing most on the first two rows, i.e. the tokens "The" and "animal", while the third column is focusing most on the bottom two rows, i.e. on "tired", which has been tokenized into two tokens.54

2024

GPT-4 was widely praised for improved accuracy and multimodal capabilities; OpenAI did not disclose its high-level architecture or parameter count.

The training compute of notable large AI models in FLOPs vs publication date over the period 2017–2024. The majority of large models are language models or multimodal models with language capacity.

The training compute of notable large AI models in FLOPs vs publication date over the period 2017–2024. The majority of large models are language models or multimodal models with language capacity.

2024

OpenAI released the reasoning model OpenAI o1, described as generating long chains of thought before returning a final answer, reflecting a shift toward model-native reasoning approaches.

Late 2024

“Reasoning models” emerged as a recognized approach to LLM development, training models to produce step-by-step analysis before final answers to improve results on complex tasks.

2024

Studies summarized in the article reported energy costs per prompt varying widely by task type, with image generation far more energy-intensive than typical text classification or generation.

According to research institute Epoch AI, energy consumption per typical ChatGPT query (0.3 watt-hours) is small compared to the average U.S. household consumption per minute (almost 20 watt-hours).209

According to research institute Epoch AI, energy consumption per typical ChatGPT query (0.3 watt-hours) is small compared to the average U.S. household consumption per minute (almost 20 watt-hours).209

2024

Examples of non-transformer approaches (including recurrent variants and Mamba, a state-space model) were noted alongside continued transformer dominance in top-performing models.

2025 (Jan)

DeepSeek released DeepSeek-R1, a 671B-parameter open-weight reasoning model reported to perform comparably to OpenAI o1 at a lower price per token, increasing attention to open-weight high-performance models.

2025

The article notes that open-weight LLMs increasingly shaped the field since 2023, with research arguing for openness that includes inclusiveness, accountability, and ethical responsibility, not just code or weights.

2025 (Apr)

OpenAI’s o3 followed o1 as another step in the “reasoning model” line, reflecting continued emphasis on step-by-step analysis for difficult tasks.

As of 2025

Prompt injection was highlighted as a significant risk when LLM-based agentic features have access to private data, motivating mitigations such as input sanitization and model auditing.

As of 2025

Legal and commercial disputes over training data and memorization accelerated, including major settlements and court decisions that depend on factual details of data acquisition, retention, and fair-use arguments.

More Timelines

Wikiwand AI