List of large language models
From Wikipedia, the free encyclopedia
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
List
For the training cost column, 1 petaFLOP-day equals 1 petaFLOP/sec × 1 day, or 8.64×1019 FLOP (floating point operations). Only the cost of the largest model is shown. The number of parameters is measured in billions,[a] and the training cost is measured in petaFLOP-days.
2018
| Name | Release date[b] | Developer | Number of parameters | Corpus size | Training cost | License[c] | Notes |
|---|---|---|---|---|---|---|---|
| GPT-1 | Jun 11 | OpenAI | 0.117B | Unknown | 1[1] | MIT[2] | |
| BERT | Oct 2018 | 0.340B[4] | 3.3B words[4] | 9[5] | Apache 2.0[6] |
An early and influential language model.[7] Encoder-only and thus not built to be prompted or generative.[8] Training took 4 days on 64 TPUv2 chips.[4] |
2019
| Name | Release date[b] | Developer | Number of parameters | Corpus size | Training cost | License[c] | Notes |
|---|---|---|---|---|---|---|---|
| T5 | Oct 2019 | 11B[9] | 34B tokens[9] | Unknown | Apache 2.0[10] |
Base model for Google projects like Imagen.[11] | |
| XLNet | Jun 2019 | 0.340B[12] | 33B words | 330 | Apache 2.0[13] |
An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.[14] | |
| GPT-2 | Feb 2019 | OpenAI | 1.5B[15] | 40GB[16] (~10B tokens)[17] | 28[18] | MIT[19] |
Trained on 32 TPUv3 chips for 1 week.[18] |
2020
2021
| Name | Release date[b] | Developer | Number of parameters | Corpus size | Training cost | License[c] | Notes |
|---|---|---|---|---|---|---|---|
| GPT-Neo | Mar 2021 | EleutherAI | 2.7B[23] | 825 GiB[24] | Unknown | MIT[25] |
The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[25] |
| GPT-J | Jun 2021 | EleutherAI | 6B[26] | 825 GiB[24] | 200[27] | Apache 2.0 | |
| Megatron-Turing NLG | Oct 2021[28] | Microsoft and Nvidia | 530B[29] | 338.6B tokens[29] | 38000[30] | Unreleased |
Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.[30] |
| Ernie 3.0 Titan | Dec 2021 | Baidu | 260B[31] | 4TB | Unknown | Proprietary | |
| Claude[32] | Dec 2021 | Anthropic | 52B[33] | 400B tokens[33] | Unknown | Proprietary |
Fine-tuned for desirable behavior in conversations.[34] |
| GLaM (Generalist Language Model) | Dec 2021 | 1200B[35] | 1.6T tokens[35] | 5600[35] | Proprietary | ||
| Gopher | Dec 2021 | Google DeepMind | 280B[36] | 300B tokens[37] | 5833[38] | Proprietary |
2022
| Name | Release date[b] | Developer | Number of parameters | Corpus size | Training cost | License[c] | Notes |
|---|---|---|---|---|---|---|---|
| LaMDA (Language Models for Dialog Applications) | Jan 2022 | 137B[39] | 1.56T words,[39] 168B tokens[37] | 4110[40] | Proprietary | ||
| GPT-NeoX | Feb 2022 | EleutherAI | 20B[41] | 825 GiB[24] | 740[27] | Apache 2.0 | |
| Chinchilla | Mar 2022 | Google DeepMind | 70B[42] | 1.4T tokens[42][37] | 6805[38] | Proprietary | |
| PaLM (Pathways Language Model) | Apr 2022 | 540B[43] | 768B tokens[42] | 29,250[38] | Proprietary | ||
| OPT (Open Pretrained Transformer) | May 2022 | Meta | 175B[44] | 180B tokens[45] | 310[27] | Non-commercial research[d] |
GPT-3 architecture with some adaptations from Megatron. The training logbook written by the team was published.[46] |
| YaLM 100B | Jun 2022 | Yandex | 100B[47] | 1.7TB[47] | Unknown | Apache 2.0 | |
| Minerva | Jun 2022 | 540B[48] | 38.5B tokens from webpages filtered for math content and from arXiv[48] | Unknown | Proprietary |
For solving "mathematical and scientific questions using step-by-step reasoning".[49] | |
| BLOOM | Jul 2022 | Large collaboration led by Hugging Face | 175B[50] | 350B tokens (1.6TB)[51] | Unknown | Responsible AI | |
| Galactica | Nov 2022 | Meta | 120B | 106B tokens[52] | Unknown | CC-BY-NC-4.0 | |
| AlexaTM (Teacher Models) | Nov 2022 | Amazon | 20B[53] | 1.3T[54] | Unknown | Proprietary[55] |
2023
| Name | Release date[b] | Developer | Number of parameters | Corpus size | Training cost | License[c] | Notes | |
|---|---|---|---|---|---|---|---|---|
| Llama | Feb 2023 | Meta AI | 65B[56] | 1.4T[56] | 6300[57] | Non-commercial research[e] | ||
| GPT-4 | Mar 2023 | OpenAI | Unknown[f] (According to rumors: 1760)[59] |
Unknown | Unknown, estimated 230,000 |
Proprietary | ||
| Cerebras-GPT | Mar 2023 | Cerebras | 13B[60] | 270[27] | Apache 2.0 | |||
| Falcon | Mar 2023 | Technology Innovation Institute | 40B[61] | 1T tokens, from RefinedWeb (filtered web text corpus)[62] plus some "curated corpora".[63] | 2800[57] | Apache 2.0[64] | ||
| BloombergGPT | Mar 2023 | Bloomberg L.P. | 50B | 363B tokens from Bloomberg's proprietary data sources, plus 345B tokens from general purpose datasets[65] | Unknown | Unreleased |
Designed for financial tasks.[65] | |
| PanGu-Σ | Mar 2023 | Huawei | 1085B | 329B tokens[66] | Unknown | Proprietary | ||
| OpenAssistant[67] | Mar 2023 | LAION | 17B | 1.5T tokens | Unknown | Apache 2.0 | ||
| Jurassic-2[68][69] | Mar 2023 | AI21 Labs | Unknown | Unknown | Unknown | Proprietary | ||
| PaLM 2 (Pathways Language Model 2) | May 2023 | 340B[70] | 3.6T tokens[70] | 85,000[57] | Proprietary | |||
| YandexGPT | May 17, 2023 | Yandex | Unknown | Unknown | Unknown | Proprietary | ||
| Phi-1 | Jun 21, 2023 | Microsoft | 1.3B[72] | 7B tokens[72] | Unknown | MIT |
Trained for 4 days on 8 A100s.[72] |
|
| Llama 2 | Jul 2023 | Meta AI | 70B[73] | 2T tokens[73] | 21,000 | Llama 2 |
Trained over 3.3 million GPU (A100) hours.[74] | |
| Claude 2 | Jul 2023 | Anthropic | Unknown | Unknown | Unknown | Proprietary |
Used in the Claude chatbot.[75] | |
| Granite 13b | Jul 2023 | IBM | Unknown | Unknown | Unknown | Proprietary |
Used in IBM Watsonx.[76] | |
| Mistral 7B | Sep 2023 | Mistral AI | 7.3B[77] | Unknown | Unknown | Apache 2.0 | ||
| YandexGPT 2 | Sep 7, 2023 | Yandex | Unknown | Unknown | Unknown | Proprietary | ||
| Claude 2.1 | Nov 2023 | Anthropic | Unknown | Unknown | Unknown | Proprietary |
Used in the Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.[78] | |
| Grok-1[79] | Nov 2023 | xAI | 314B | Unknown | Unknown | Apache 2.0 | ||
| Gemini 1.0 | Dec 2023 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary |
Multimodal model, comes in three sizes. Used in the chatbot of the same name.[81] | |
| Mixtral 8x7B | Dec 2023 | Mistral AI | 46.7B | Unknown | Unknown | Apache 2.0 |
Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.[82] Mixture of experts model, with 12.9 billion parameters activated per token.[83] | |
| DeepSeek-LLM | Nov 29, 2023 | DeepSeek | 67B | 2T tokens[84]: table 2 | 12,000 | DeepSeek |
Trained on English and Chinese text. Used 1024 training FLOPs for 67B model, 10b FLOPs for 7B.[84]: figure 5 | |
| Phi-2 | Dec 2023 | Microsoft | 2.7B | 1.4T tokens | 419[85] | MIT |
Trained on real and synthetic "textbook-quality" data over 14 days on 96 A100 GPUs.[85] |
2024
| Name | Release date[b] | Developer | Number of parameters | Corpus size | Training cost | License[c] | Notes |
|---|---|---|---|---|---|---|---|
| Gemini 1.5 | Feb 2024 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary |
Multimodal model based on a MoE architecture. Context window above 1 million tokens.[86] |
| Gemini Ultra | Feb 2024 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | |
| Gemma | Feb 2024 | Google DeepMind | 7B | 6T tokens | Unknown | Gemma Terms of Use[87] | |
| OLMo | Feb 2024 | Allen Institute for AI | 7B[88] | 2T tokens[89] | Unknown | Apache 2.0 | |
| Claude 3 | Mar 2024 | Anthropic | Unknown | Unknown | Unknown | Proprietary |
Includes three models: Haiku, Sonnet, and Opus.[90] |
| DBRX | Mar 2024 | Databricks and Mosaic ML | 136B | 12T tokens | Unknown | Databricks Open Model[91][92] | |
| YandexGPT 3 Pro | Mar 28, 2024 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| Fugaku-LLM[93] | May 2024 | Fujitsu, Tokyo Institute of Technology, Tohoku University, RIKEN, etc. | 13B | 380B tokens | Unknown | Fugaku-LLM Terms of Use[94] | |
| Chameleon | May 2024 | Meta AI | 34B[96] | 4.4T | Unknown | Non-commercial research[97] | |
| Mixtral 8x22B[98] | Apr 17, 2024 | Mistral AI | 141B | Unknown | Unknown | Apache 2.0 | |
| Phi-3 | Apr 23, 2024 | Microsoft | 14B[99] | 4.8T tokens[100] | Unknown | MIT |
Marketed by Microsoft as a "small language model".[99] |
| Granite Code Models | May 2024 | IBM | Unknown | Unknown | Unknown | Apache 2.0 | |
| YandexGPT 3 Lite | May 28, 2024 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| Qwen2 | Jun 2024 | Alibaba Cloud | 72B[101] | 3T tokens | Unknown | Various | |
| DeepSeek-V2 | Jun 2024 | DeepSeek | 236B | 8.1T tokens | 28,000 | DeepSeek |
1.4M hours on H800.[102] |
| Nemotron-4 | Jun 2024 | Nvidia | 340B | 9T tokens | 200,000 | NVIDIA Open Model[103][104] | |
| Claude 3.5 | Jun 2024 | Anthropic | Unknown | Unknown | Unknown | Proprietary | |
| Llama 3.1 | Jul 2024 | Meta AI | 405B | 15.6T tokens | 440,000 | Llama 3 | |
| Grok-2 | Aug 14, 2024 | xAI | Unknown | Unknown | Unknown | xAI Community License Agreement[111][112] | |
| OpenAI o1 | Sep 12, 2024 | OpenAI | Unknown | Unknown | Unknown | Proprietary | |
| Sarvam-1 | Oct 24, 2024 | Sarvam AI | 2B | ~2T tokens | Unknown | Sarvam AI Research | |
| YandexGPT 4 Lite and Pro | Oct 24, 2024 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| Mistral Large | Nov 2024 | Mistral AI | 123B | Unknown | Unknown | Mistral Research |
Upgraded over time. The latest version is 24.11.[119] |
| Pixtral | Nov 2024 | Mistral AI | 123B | Unknown | Unknown | Mistral Research |
Multimodal. There is also a 12B version which is under Apache 2 license.[119] |
| OLMo 2 | Nov 2024 | Allen Institute for AI | 32B[120][121] | 6.6T tokens[121] | 15,000[121] | Apache 2.0 | |
| Phi-4 | Dec 12, 2024 | Microsoft | 14B[122] | 9.8T tokens | Unknown | MIT |
Marketed by Microsoft as a "small language model".[123] |
| DeepSeek-V3 | Dec 2024 | DeepSeek | 671B | 14.8T tokens | 56,000 | MIT | |
| Amazon Nova | Dec 2024 | Amazon | Unknown | Unknown | Unknown | Proprietary |
Includes three models: Nova Micro, Nova Lite, and Nova Pro.[126] |
2025
| Name | Release date[b] | Developer | Number of parameters | Corpus size | License[c] | Notes |
|---|---|---|---|---|---|---|
| DeepSeek-R1 | Jan 20 | DeepSeek | 671B | Not applicable | MIT | |
| Qwen2.5 | Jan 26 | Alibaba | 72B | 18T tokens | Various |
7 dense models with parameter counts from 0.5B to 72B. Alibaba also released 2 MoE variants.[129] |
| MiniMax-Text-01 | Jan 14 | Minimax | 456B | 4.7T tokens[130] | Minimax Model | |
| Gemini 2.0 | Feb 5 | Google DeepMind | Unknown | Unknown | Proprietary | |
| Grok 3 | Feb 19 | xAI | Unknown | Unknown | Proprietary |
Training cost claimed to be "10x the compute of previous state-of-the-art models".[135] |
| Claude 3.7 | Feb 24 | Anthropic | Unknown | Unknown | Proprietary |
One model, Sonnet 3.7.[136] |
| YandexGPT 5 Lite Pretrain and Pro | Feb 25 | Yandex | Unknown | Unknown | Proprietary | |
| GPT-4.5 | Feb 27 | OpenAI | Unknown | Unknown | Proprietary |
OpenAI's largest non-reasoning model at the time.[137] |
| Gemini 2.5 | Mar 25 | Google DeepMind | Unknown | Unknown | Proprietary |
Three models released: Flash, Flash-Lite and Pro.[138] |
| YandexGPT 5 Lite Instruct | Mar 31 | Yandex | Unknown | Unknown | Proprietary | |
| Llama 4 | Apr 5 | Meta AI | 400B | 40T tokens | Llama 4 | |
| OpenAI o3 and o4-mini | Apr 16 | OpenAI | Unknown | Unknown | Proprietary |
Reasoning models.[141] |
| Qwen3 | Apr 28 | Alibaba Cloud | 235B | 36T tokens | Apache 2.0 |
Multiple sizes, the smallest being 0.6B.[142] |
| Claude 4 | May 22 | Anthropic | Unknown | Unknown | Proprietary |
Includes two models, Sonnet and Opus.[143] |
| Sarvam-M | May 23 | Sarvam AI | 24B | Unknown | Apache 2.0 | |
| Grok 4 | Jul 9 | xAI | Unknown | Unknown | Proprietary | |
| Param-1 | Jul 21 | BharatGen | 2.9B[147] | 5T tokens[g][147] | Apache 2.0 | |
| GLM-4.5 | Jul 29 | Z.ai | 355B | 22T tokens[149][h] | MIT |
Released in 355B and 106B sizes.[150] |
| GPT-OSS | Aug 5 | OpenAI | 117B | Unknown | Apache 2.0 |
Released in 20B and 120B sizes.[151] |
| Claude 4.1 | Aug 5 | Anthropic | Unknown | Unknown | Proprietary |
Includes one model, Opus.[152] |
| GPT-5 | Aug 7 | OpenAI | Unknown | Unknown | Proprietary | |
| DeepSeek-V3.1 | Aug 21 | DeepSeek | 671B | 15.639T | MIT | |
| YandexGPT 5.1 Pro | Aug 28 | Yandex | Unknown | Unknown | Proprietary | |
| Apertus | Sep 2 | ETH Zurich and EPF Lausanne | 70B | 15T[157] | Apache 2.0 | |
| Claude Sonnet 4.5 | Sep 29 | Anthropic | Unknown | Unknown | Proprietary | |
| GLM-4.6 | Sep 30 | Z.ai | 357B | Unknown | Apache 2.0 | |
| Alice AI LLM 1.0 | Oct 28 | Yandex | Unknown | Unknown | Proprietary | |
| Gemini 3 | Nov 18 | Google DeepMind | Unknown | Unknown | Proprietary |
Models released: Deep Think and Pro.[163] |
| Olmo 3[164] | Nov 20 | Allen Institute for AI | 32B | 5.9T tokens[165] | Apache 2.0 |
Includes 7B and 32B parameter versions, alongside reasoning and instruction-following models.[165] |
| Claude Opus 4.5 | Nov 24 | Anthropic | Unknown | Unknown | Proprietary |
Largest model in the Claude family.[166] |
| DeepSeek-V3.2 | Dec 1 | DeepSeek | 685B | Unknown | MIT | |
| GPT 5.2 | Dec 11 | OpenAI | Unknown | Unknown | Proprietary |
It was able to solve an open problem in statistical learning theory that had previously remained unresolved by human researchers.[170] |
| GLM-4.7 | Dec 22 | Z.ai | 355B | Unknown | Apache 2.0 |
2026
| Name | Release date[b] | Developer | Number of parameters | Corpus size | License[c] | Notes |
|---|---|---|---|---|---|---|
| Qwen3-Max-Thinking | Jan 26 | Alibaba Cloud | Unknown | Unknown | Proprietary |
Proprietary reasoning model with adaptive tool-use, test-time scaling, and iterative self-reflection.[171] |
| Kimi K2.5 | Jan 27 | Moonshot AI | 1040B | 15T tokens | Modified MIT | |
| Step-3.5-Flash | Feb 12 | StepFun | 196B | Unknown | Apache 2.0 | |
| Claude Opus 4.6 | Feb 5 | Anthropic | Unknown | Unknown | Proprietary | |
| GPT-5.3-Codex | Feb 5 | OpenAI | Unknown | Unknown | Proprietary | |
| GLM-5 | Feb 12 | Z.ai | 754B | Unknown | MIT | |
| Claude Sonnet 4.6 | Feb 17 | Anthropic | Unknown | Unknown | Proprietary | |
| Param-2 | Feb 17 | BharatGen | 17B | ~22T tokens | BharatGen Research[178] |
Mixture-of-experts model, successor of Param-1; many more Indic languages are supported. Trained on H100 GPUs for 24 days.[179] |
| Sarvam-105B | Feb 18[i] | Sarvam AI | 105B[181] | 12T tokens[181] | Apache 2.0 | |
| Sarvam-30B | 30B[181] | 16T tokens[181] | ||||
| GPT-5.4 | Mar 5 | OpenAI | Unknown | Unknown | Proprietary | |
| Mistral Small 4 | Mar 17 | Mistral AI | 119B | Unknown | Apache 2.0 | |
| MiMo-V2-Pro | Mar 18 | Xiaomi | 1000B[187] | Unknown | Proprietary |
Mixture-of-experts (MoE) model with more than 1 trillion parameters (43 billion active). Designed for agentic scenarios. Initially available on OpenRouter under the codename "Hunter Alpha" before official release.[188] |
| Gemma 4 | Apr 2 | Google DeepMind | 31B | Unknown | Apache 2.0 | |
| GLM-5.1 | Apr 7 | Z.ai | 754B | Unknown | MIT | |
| Muse Spark | Apr 8 | Meta Superintelligence Labs | Unknown | Unknown | Proprietary | |
| Qwen3.6 (Qwen3.6-35B-A3B) | Apr 15 | Alibaba Cloud | 35B | Unknown | Apache 2.0 | |
| Claude Opus 4.7 | Apr 16 | Anthropic | Unknown | Unknown | Proprietary | |
| GPT-5.5 | Apr 23 | OpenAI | Unknown | Unknown | Proprietary | |
| DeepSeek-V4-Flash | Apr 24 | DeepSeek | 284B | 32T | MIT |
Preview release[196] |
| DeepSeek-V4-Pro | 1.6T | |||||
| MiMo-V2.5-Pro | Apr 27 | Xiaomi | 1.02T | 48T | MIT | |
| MiMo-V2.5 | 310B | 27T |
Omni-modal MoE model with agentic capabilities and 1M-token context.[199] | |||
| Gemini 3.5 Flash | May 19 | Google DeepMind | Unknown | Unknown | Proprietary | |
| Claude Opus 4.8 | May 28 | Anthropic | Unknown | Unknown | Proprietary | |
| Step 3.7 Flash | May 29 | StepFun | 198B[j] | Unknown | Apache 2.0 |
See also
Notes
- In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.
- This is the date that documentation describing the model's architecture was first released.
- This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated. LLMs may be licensed differently from the chatbots that use them; for the licenses of chatbots, see List of chatbots.
- The smaller models including 66B are publicly available, while the 175B model is available on request.
- Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.
- "focus[ed] on India’s linguistic landscape"
- Corpus size was calculated by combining the 15 trillion tokens and the 7 trillion tokens pre-training mix.
- 196B + 1.8B (ViT)