Large language model

Early 1990s

IBM’s statistical machine translation models pioneered word-alignment techniques, helping establish corpus-based approaches that later influenced large-scale language modeling.

#History

2000

Researchers began using neural networks for language modeling, marking an early shift away from purely count-based n-gram approaches.

#History

2001

Smoothed n-gram models (including Kneser–Ney smoothing) trained on roughly 300 million words achieved state-of-the-art perplexity on benchmark tests, illustrating the power of large corpora even before deep learning took over.

#History

2000s

With widespread internet access, researchers increasingly compiled massive web text datasets (“web as corpus”) to train larger statistical language models.

#History

~2012

After deep neural networks succeeded in image classification, similar deep-learning architectures were adapted for language tasks, accelerating progress in neural language modeling.

#History

2013

Word embeddings (for example, Word2Vec) became a key building block for neural NLP systems by representing words as dense vectors learned from large corpora.

#History

2014

The attention mechanism (Bahdanau et al.) was introduced for seq2seq models, enabling better handling of long-range dependencies and influencing later transformer designs.

#History

2016

Google switched its translation service to neural machine translation, replacing statistical phrase-based translation with deep recurrent neural networks using LSTM encoder–decoder architectures.

#History

2017

Google researchers introduced the transformer architecture in the NeurIPS paper “Attention Is All You Need,” establishing the core approach behind most modern large language models.

#History

2017

Mixture-of-experts (MoE) architectures were introduced by Google researchers, using gated routing to activate only a subset of model parameters per input to reduce inference costs.

#Architecture

2018

BERT was introduced and rapidly became widely used in research and applications, popularizing transformer-based pretraining for language understanding tasks.

#History

2018

GPT-1 (decoder-only) was introduced, helping establish the autoregressive transformer approach that later scaled into widely deployed LLMs.

#History

2019

GPT-2 drew major attention after OpenAI initially withheld full release due to concerns about malicious use, intensifying public debate about LLM risks.

#History

2019

Training-cost comparisons highlighted rapidly rising compute requirements: GPT-2 (1.5B parameters) was estimated to cost about $50,000 to train.

#Training

2020

GPT-3 was released and, as of 2025, remained available only via API (not as downloadable weights), reflecting a trend toward controlled deployment of frontier models.

#History

2020

Few-shot prompting was demonstrated with GPT-3, showing that models could adapt to tasks from examples in the prompt without explicit fine-tuning.

#Extensibility

2021

Megatron-Turing NLG 530B was estimated to cost around $11 million to train, illustrating the scale of infrastructure needed for frontier LLMs.

#Training

2022

PaLM (540B parameters) was estimated to cost about $8 million to train, further demonstrating escalating training costs at very large scales.

#Training

2022

ChatGPT’s consumer-facing release brought LLMs to broad public attention and drove rapid uptake across research areas such as robotics, software engineering, and societal impact work.

#History

2022

OpenAI demonstrated InstructGPT, using instruction fine-tuning (and related methods such as RLHF) to make GPT-style models better at following user instructions.

#Training

2022

Prompt chaining and chain-of-thought prompting were introduced as methods to improve performance on multi-step problems by structuring intermediate reasoning steps.

#Extensibility

2023

Research usage of BERT began to decline as decoder-only GPT-style models improved at solving tasks via prompting, shifting mainstream practice toward instruction- and chat-oriented LLMs.

#History

2023

Open-weight models gained popularity, with prominent early examples including BLOOM and LLaMA (despite restrictions), broadening access to strong LLMs outside a few companies.

#History

2023

A study found that prompting ChatGPT 3.5 turbo to repeat a word many times could lead it to output excerpts from its training data, highlighting memorization risks.

#Societal concerns

2023

A comparison of LLM fact-checking performance found moderate proficiency, with GPT-4 reported as achieving the highest accuracy among tested models (71%), still behind human fact-checkers.

#Evaluation

2024

Google introduced Gemini 1.5 with a context window reported to reach up to 1 million tokens, illustrating rapid expansion of workable context length for transformer models.

#Architecture

2024

GPT-4 was widely praised for improved accuracy and multimodal capabilities; OpenAI did not disclose its high-level architecture or parameter count.

#History

2024

OpenAI released the reasoning model OpenAI o1, described as generating long chains of thought before returning a final answer, reflecting a shift toward model-native reasoning approaches.

#History

Late 2024

“Reasoning models” emerged as a recognized approach to LLM development, training models to produce step-by-step analysis before final answers to improve results on complex tasks.

#Extensibility

2024

Studies summarized in the article reported energy costs per prompt varying widely by task type, with image generation far more energy-intensive than typical text classification or generation.

#Societal concerns

2024

Examples of non-transformer approaches (including recurrent variants and Mamba, a state-space model) were noted alongside continued transformer dominance in top-performing models.

#History

2025 (Jan)

DeepSeek released DeepSeek-R1, a 671B-parameter open-weight reasoning model reported to perform comparably to OpenAI o1 at a lower price per token, increasing attention to open-weight high-performance models.

#History

2025

The article notes that open-weight LLMs increasingly shaped the field since 2023, with research arguing for openness that includes inclusiveness, accountability, and ethical responsibility, not just code or weights.

#History

2025 (Apr)

OpenAI’s o3 followed o1 as another step in the “reasoning model” line, reflecting continued emphasis on step-by-step analysis for difficult tasks.

#Extensibility

As of 2025

Prompt injection was highlighted as a significant risk when LLM-based agentic features have access to private data, motivating mitigations such as input sanitization and model auditing.

#Safety

As of 2025

Legal and commercial disputes over training data and memorization accelerated, including major settlements and court decisions that depend on factual details of data acquisition, retention, and fair-use arguments.

#Societal concerns

Timeline of large language model development

More Timelines