List of large language models
From Wikipedia, the free encyclopedia
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
List
For the training cost column, 1 petaFLOP-day equals 1 petaFLOP/sec × 1 day, or 8.64×1019 FLOP (floating point operations). Only the cost of the largest model is shown.
| Name | Release date[a] | Developer | Number of parameters (billion) [b] | Corpus size | Training cost (petaFLOP- | License[c] | Notes |
|---|---|---|---|---|---|---|---|
| GPT-1 | June 11, 2018 | 0.117 | Unknown | 1[1] | MIT[2] | First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs.[3] | |
| BERT | October 2018 | 0.340[4] | 3.3 billion words[4] | 9[5] | Apache 2.0[6] | An early and influential language model.[7] Encoder-only and thus not built to be prompted or generative.[8] Training took 4 days on 64 TPUv2 chips.[9] | |
| T5 | October 2019 | 11[10] | 34 billion tokens[10] | Apache 2.0[11] | Base model for many Google projects, such as Imagen.[12] | ||
| XLNet | June 2019 | 0.340[13] | 33 billion words | 330 | Apache 2.0[14] | An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.[15] | |
| GPT-2 | February 2019 | 1.5[16] | 40GB[17] (~10 billion tokens)[18] | 28[19] | MIT[20] | Trained on 32 TPUv3 chips for 1 week.[19] | |
| GPT-3 | May 2020 | 175[21] | 300 billion tokens[18] | 3640[22] | Proprietary | A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.[23] | |
| GPT-Neo | March 2021 | EleutherAI | 2.7[24] | 825 GiB[25] | Unknown | MIT[26] | The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[26] |
| GPT-J | June 2021 | EleutherAI | 6[27] | 825 GiB[25] | 200[28] | Apache 2.0 | GPT-3-style language model trained on The Pile. |
| Megatron-Turing NLG | October 2021[29] | 530[30] | 338.6 billion tokens[30] | 38000[31] | Unreleased | Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.[31] | |
| Ernie 3.0 Titan | December 2021 | 260[32] | 4TB | Unknown | Proprietary | Chinese-language LLM initially used by Ernie Bot. | |
| Claude[33] | December 2021 | 52[34] | 400 billion tokens[34] | Unknown | Proprietary | Fine-tuned for desirable behavior in conversations.[35] | |
| GLaM (Generalist Language Model) | December 2021 | 1200[36] | 1.6 trillion tokens[36] | 5600[36] | Proprietary | Sparse mixture of experts (MoE) model, making it more expensive to train but cheaper to run inference compared to GPT-3. | |
| Gopher | December 2021 | 280[37] | 300 billion tokens[38] | 5833[39] | Proprietary | Later developed into the Chinchilla model. | |
| LaMDA (Language Models for Dialog Applications) | January 2022 | 137[40] | 1.56T words,[40] 168 billion tokens[38] | 4110[41] | Proprietary | Specialized for response generation in conversations. | |
| GPT-NeoX | February 2022 | EleutherAI | 20[42] | 825 GiB[25] | 740[28] | Apache 2.0 | Based on the Megatron architecture. |
| Chinchilla | March 2022 | 70[43] | 1.4 trillion tokens[43][38] | 6805[39] | Proprietary | Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law. | |
| PaLM (Pathways Language Model) | April 2022 | 540[44] | 768 billion tokens[43] | 29,250[39] | Proprietary | Trained for ~60 days on ~6000 TPU v4 chips.[39] | |
| OPT (Open Pretrained Transformer) | May 2022 | 175[45] | 180 billion tokens[46] | 310[28] | Non-commercial research[d] | GPT-3 architecture with some adaptations from Megatron. The training logbook written by the team was published.[47] | |
| YaLM 100B | June 2022 | 100[48] | 1.7TB[48] | Unknown | Apache 2.0 | English-Russian model based on Microsoft's Megatron-LM. | |
| Minerva | June 2022 | 540[49] | 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server[49] | Unknown | Proprietary | For solving "mathematical and scientific questions using step-by-step reasoning".[50] Initialized from PaLM models, then finetuned on mathematical and scientific data. | |
| BLOOM | July 2022 | Large collaboration led by Hugging Face | 175[51] | 350 billion tokens (1.6TB)[52] | Unknown | Responsible AI | Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages). |
| Galactica | November 2022 | 120 | 106 billion tokens[53] | Unknown | CC-BY-NC-4.0 | Trained on scientific text and modalities. | |
| AlexaTM (Teacher Models) | November 2022 | 20[54] | 1.3 trillion[55] | Unknown | Proprietary[56] | Uses a bidirectional sequence-to-sequence architecture. | |
| Llama | February 2023 | 65[57] | 1.4 trillion[57] | 6300[58] | Non-commercial research[e] | Corpus has 20 languages. "Overtrained" (compared to the Chinchilla scaling law) for better performance with fewer parameters.[57][failed verification] | |
| GPT-4 | March 2023 | Unknown[f] (According to rumors: 1760)[60] |
Unknown | Unknown, estimated 230,000 |
Proprietary | Now available for all ChatGPT users; used in several products. | |
| Cerebras-GPT | March 2023 | 13[61] | 270[28] | Apache 2.0 | Trained with the Chinchilla neural scaling formula. | ||
| Falcon | March 2023 | 40[62] | 1 trillion tokens, from RefinedWeb (filtered web text corpus)[63] plus some "curated corpora".[64] | 2800[58] | Apache 2.0[65] | ||
| BloombergGPT | March 2023 | 50 | 363 billion tokens from Bloomberg's proprietary data sources, plus 345 billion tokens from general purpose datasets[66] | Unknown | Unreleased | Designed for financial tasks.[66] | |
| PanGu-Σ | March 2023 | 1085 | 329 billion tokens[67] | Unknown | Proprietary | ||
| OpenAssistant[68] | March 2023 | 17 | 1.5 trillion tokens | Unknown | Apache 2.0 | Trained on crowdsourced, open conversational data. | |
| Jurassic-2[69][70] | March 2023 | Unknown | Unknown | Unknown | Proprietary | ||
| PaLM 2 (Pathways Language Model 2) | May 2023 | 340[71] | 3.6 trillion tokens[71] | 85,000[58] | Proprietary | Used in the Bard chatbot.[72] | |
| YandexGPT | May 17, 2023 | Unknown | Unknown | Unknown | Proprietary | Used in the Alice chatbot. | |
| Llama 2 | July 2023 | 70[73] | 2 trillion tokens[73] | 21,000 | Llama 2 license | Trained over 3.3 million GPU (A100) hours.[74] | |
| Claude 2 | July 2023 | Unknown | Unknown | Unknown | Proprietary | Used in the Claude chatbot.[75] | |
| Granite 13b | July 2023 | Unknown | Unknown | Unknown | Proprietary | Used in IBM Watsonx.[76] | |
| Mistral 7B | September 2023 | 7.3[77] | Unknown | Unknown | Apache 2.0 | ||
| YandexGPT 2 | September 7, 2023 | Unknown | Unknown | Unknown | Proprietary | Used in the Alice chatbot. | |
| Claude 2.1 | November 2023 | Unknown | Unknown | Unknown | Proprietary | Used in the Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.[78] | |
| Grok 1[79] | November 2023 | 314 | Unknown | Unknown | Apache 2.0 | Used in the Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X (Twitter).[80] | |
| Gemini 1.0 | December 2023 | Unknown | Unknown | Unknown | Proprietary | Multimodal model, comes in three sizes. Used in the chatbot of the same name.[81] | |
| Mixtral 8x7B | December 2023 | 46.7 | Unknown | Unknown | Apache 2.0 | Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.[82] Mixture of experts model, with 12.9 billion parameters activated per token.[83] | |
| DeepSeek-LLM | November 29, 2023 | 67 | 2T tokens[84]: table 2 | 12,000 | DeepSeek License | Trained on English and Chinese text. Used 1024 training FLOPs for 67B model, 10b FLOPs for 7B.[84]: figure 5 | |
| Phi-2 | December 2023 | 2.7 | 1.4T tokens | 419[85] | MIT | Trained on real and synthetic "textbook-quality" data over 14 days on 96 A100 GPUs.[85] | |
| Gemini 1.5 | February 2024 | Unknown | Unknown | Unknown | Proprietary | Multimodal model based on a MoE architecture. Context window above 1 million tokens.[86] | |
| Gemini Ultra | February 2024 | Unknown | Unknown | Unknown | Proprietary | ||
| Gemma | February 2024 | 7 | 6T tokens | Unknown | Gemma Terms of Use[87] | ||
| OLMo | February 2024 | 7[88] | 2T tokens[89] | Unknown | Apache 2.0 | ||
| Claude 3 | March 2024 | Unknown | Unknown | Unknown | Proprietary | Includes three models: Haiku, Sonnet, and Opus.[90] | |
| DBRX | March 2024 | 136 | 12T tokens | Unknown | Databricks Open Model License[91][92] | Training cost 10 million USD.[citation needed] | |
| YandexGPT 3 Pro | March 28, 2024 | Unknown | Unknown | Unknown | Proprietary | Used in Alice chatbot. | |
| Fugaku-LLM | May 2024 | 13 | 380B tokens | Unknown | Fugaku-LLM Terms of Use[93] | The largest model ever trained on CPU-only, on the Fugaku supercomputer.[94] | |
| Chameleon | May 2024 | 34[95] | 4.4 trillion | Unknown | Non-commercial research[96] | ||
| Mixtral 8x22B[97] | April 17, 2024 | 141 | Unknown | Unknown | Apache 2.0 | ||
| Phi-3 | April 23, 2024 | 14[98] | 4.8T tokens[citation needed] | Unknown | MIT | Marketed by Microsoft as a "small language model".[99] | |
| Granite Code Models | May 2024 | Unknown | Unknown | Unknown | Apache 2.0 | ||
| YandexGPT 3 Lite | May 28, 2024 | Unknown | Unknown | Unknown | Proprietary | Used in the Alice chatbot. | |
| Qwen2 | June 2024 | 72[100] | 3T tokens | Unknown | Qwen License | Multiple sizes, the smallest being 0.5B. | |
| DeepSeek-V2 | June 2024 | 236 | 8.1T tokens | 28,000 | DeepSeek License | 1.4M hours on H800.[101] | |
| Nemotron-4 | June 2024 | 340 | 9T tokens | 200,000 | NVIDIA Open Model License[102][103] | Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.[104][105] | |
| Claude 3.5 | June 2024 | Unknown | Unknown | Unknown | Proprietary | Initially, only one model, Sonnet, was released.[106] In October 2024, Sonnet 3.5 was upgraded, and Haiku 3.5 became available.[107] | |
| Llama 3.1 | July 2024 | 405 | 15.6T tokens | 440,000 | Llama 3 license | 405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.[108][109] | |
| Grok-2 | August 14, 2024 | Unknown | Unknown | Unknown | xAI Community License Agreement[110][111] | Originally closed-source, then re-released as "Grok 2.5" under a source-available license in August 2025.[112][113] | |
| OpenAI o1 | September 12, 2024 | Unknown | Unknown | Unknown | Proprietary | First LLM described as a "reasoning model".[114][115][better source needed] | |
| YandexGPT 4 Lite and Pro | October 24, 2024 | Unknown | Unknown | Unknown | Proprietary | Used in the Alice chatbot. | |
| Mistral Large | November 2024 | 123 | Unknown | Unknown | Mistral Research License | Upgraded over time. The latest version is 24.11.[116] | |
| Pixtral | November 2024 | 123 | Unknown | Unknown | Mistral Research License | Multimodal. There is also a 12B version which is under Apache 2 license.[116] | |
| OLMo 2 | November 2024 | 32[117][118] | 6.6T tokens[118] | 15,000[118] | Apache 2.0 | Initially had 7B and 13B parameter variants, with 32B released later. | |
| Phi-4 | December 12, 2024 | 14[119] | 9.8T tokens | Unknown | MIT | Marketed by Microsoft as a "small language model".[120] | |
| DeepSeek-V3 | December 2024 | 671 | 14.8T tokens | 56,000 | MIT | Used 2.788M training hours on H800 GPUs.[121] Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025.[122] | |
| Amazon Nova | December 2024 | Unknown | Unknown | Unknown | Proprietary | Includes three models: Nova Micro, Nova Lite, and Nova Pro.[123] | |
| DeepSeek-R1 | January 2025 | 671 | Not applicable | Unknown | MIT | No pretraining; reinforcement-learned upon V3-Base.[124][125] | |
| Qwen2.5 | January 2025 | 72 | 18T tokens | Unknown | Qwen License | 7 dense models with parameter counts from 0.5B to 72B. Alibaba also released 2 MoE variants.[126] | |
| MiniMax-Text-01 | January 2025 | 456 | 4.7T tokens[127] | Unknown | Minimax Model license | [128][127] | |
| Gemini 2.0 | February 2025 | Unknown | Unknown | Unknown | Proprietary | Three models released: Flash, Flash-Lite and Pro.[129][130][131] | |
| Claude 3.7 | February 24, 2025 | Unknown | Unknown | Unknown | Proprietary | One model, Sonnet 3.7.[132] | |
| YandexGPT 5 Lite Pretrain and Pro | February 25, 2025 | Unknown | Unknown | Unknown | Proprietary | Used in the Alice Neural Network chatbot. | |
| GPT-4.5 | February 27, 2025 | Unknown | Unknown | Unknown | Proprietary | OpenAI's largest non-reasoning model at the time.[133] | |
| Grok 3 | February 2025 | Unknown | Unknown | Unknown | Proprietary | Training cost claimed to be "10x the compute of previous state-of-the-art models".[134] | |
| Gemini 2.5 | March 25, 2025 | Unknown | Unknown | Unknown | Proprietary | Three models released: Flash, Flash-Lite and Pro.[135] | |
| YandexGPT 5 Lite Instruct | March 31, 2025 | Unknown | Unknown | Unknown | Proprietary | Used in the Alice Neural Network chatbot. | |
| Llama 4 | April 5, 2025 | 400 | 40T tokens | Unknown | Llama 4 license | [136][137] | |
| OpenAI o3 and o4-mini | April 16, 2025 | Unknown | Unknown | Unknown | Proprietary | Reasoning models.[138] | |
| Qwen3 | April 2025 | 235 | 36T tokens | Unknown | Apache 2.0 | Multiple sizes, the smallest being 0.6B.[139] | |
| Claude 4 | May 22, 2025 | Unknown | Unknown | Unknown | Proprietary | Includes two models, Sonnet and Opus.[140] | |
| Sarvam-M | May 23, 2025 | 24 | Unknown | Unknown | Apache 2.0 | Hybrid reasoning model fine-tuned on Mistral Small base; optimized for math, programming, and Indian languages.[141][142] | |
| Grok 4 | July 9, 2025 | Unknown | Unknown | Unknown | Proprietary | [citation needed] | |
| Param-1 | July 21, 2025 | 2.9[143] | 5T tokens "focus[ed] on India’s linguistic landscape"[143] | Unknown | Unknown | ||
| GLM-4.5 | July 29, 2025 | 355 | 22T tokens[144][g] | Unknown | MIT | Released in 335B and 106B sizes.[145] | |
| GPT-OSS | August 5, 2025 | 117 | Unknown | Unknown | Apache 2.0 | Released in 20B and 120B sizes.[146] | |
| Claude 4.1 | August 5, 2025 | Unknown | Unknown | Unknown | Proprietary | Includes one model, Opus.[147] | |
| GPT-5 | August 7, 2025 | Unknown | Unknown | Unknown | Proprietary | Includes three models: GPT-5, GPT-5 mini, and GPT-5 nano. GPT-5 is available in ChatGPT and API. It includes reasoning abilities. [148][149] | |
| DeepSeek-V3.1 | August 21, 2025 | 671 | 15.639T | MIT | Based on DeepSeek V3 (trained on 14.8T tokens); further trained on 839B tokens from the extension phases (630B + 209B).[150] A hybrid model that can switch between thinking and non-thinking modes.[151] | ||
| YandexGPT 5.1 Pro | August 28, 2025 | Unknown | Unknown | Unknown | Proprietary | Used in the Alice Neural Network chatbot. | |
| Apertus | September 2, 2025 | 70 | 15 trillion[152] | Unknown | Apache 2.0 | The first LLM to be compliant with theArtificial Intelligence Act of the European Union.[153] | |
| Claude Sonnet 4.5 | September 29, 2025 | Unknown | Unknown | Unknown | Proprietary | [154] | |
| DeepSeek-V3.2-Exp | September 29, 2025 | 685 | MIT | Experimental model built upon v3.1-Terminus; uses a custom DeepSeek Sparse Attention (DSA) model.[155][156][157] | |||
| GLM-4.6 | September 30, 2025 | 357 | Apache 2.0 | [158][159][160] | |||
| Alice AI LLM 1.0 | October 28, 2025 | Unknown | Unknown | Unknown | Proprietary | Available in the Alice AI chatbot. | |
| Gemini 3 | November 18, 2025 | Unknown | Unknown | Unknown | Proprietary | Two models released: Deep Think and Pro.[161] | |
| Olmo 3[162] | November 20, 2025 | 32 | 5.9T tokens[163] | Unknown | Apache 2.0 | Includes 7B and 32B parameter versions, alongside reasoning and instruction-following models.[163] | |
| Claude Opus 4.5 | November 24, 2025 | Unknown | Unknown | Unknown | Proprietary | The largest model in the Claude family.[164] | |
| GPT 5.2 | December 11, 2025 | Unknown | Unknown | Unknown | Proprietary | It was able to solve an open problem in statistical learning theory that had previously remained unresolved by human researchers.[165] | |
| GLM-4.7 | December 22, 2025 | 355 | Apache 2.0 | MoE architecture. Open-source state-of-the-art on coding benchmarks.[citation needed] A smaller Flash variant (30B-A3B) was released on January 19, 2026. | |||
| Qwen3-Max-Thinking | January 26, 2026 | Unknown | Unknown | Unknown | Proprietary | Proprietary reasoning model with adaptive tool-use, test-time scaling, and iterative self-reflection.[166] | |
| Kimi K2.5 | January 27, 2026 | 1040 | 15T tokens | Modified MIT License | Multimodal MoE with 32B active parameters, derived from Kimi K2.[167] Can use "Agent Swarm" technology to coordinate up to 100 parallel sub-agents.[168][169] | ||
| Claude Opus 4.6 | February 5, 2026 | Unknown | Unknown | Unknown | Proprietary | ||
| GPT-5.3-Codex | February 5, 2026 | Unknown | Unknown | Unknown | Proprietary | ||
| GLM-5 | February 12, 2026 | 754 | MIT | Specialized for agentic engineering and long-horizon tasks. Integrates DeepSeek Sparse Attention (DSA) for 200K context. | |||
| Param-2 | February 17, 2026 | 17 | ~22T tokens | Unknown |
|
Mixture-of-experts model, successor of Param-1; many more Indic languages are supported. Trained on H100 GPUs for 24 days.[170] | |
| Sarvam-1[171] | February 18, 2026[h] | 105 | ~12T Tokens | Unknown | India's first independently-trained foundation model; has 105B and 30B versions.[173] Based on mixture-of-experts model, using only 10.3B active parameters at a time.[174] Superior in Indic languages.[compared to?] |
See also
Notes
- This is the date that documentation describing the model's architecture was first released.
- In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.
- This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated. LLMs may be licensed differently from the chatbots that use them; for the licenses of chatbots, see List of chatbots.
- The smaller models including 66B are publicly available, while the 175B model is available on request.
- Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.
- Corpus size was calculated by combining the 15 trillion tokens and the 7 trillion tokens pre-training mix.