Blackwell (microarchitecture)

GPU microarchitecture designed by Nvidia From Wikipedia, the free encyclopedia

Blackwell is a graphics processing unit (GPU) microarchitecture developed by Nvidia as the successor to the Hopper and Ada Lovelace microarchitectures.

A GB200 die with Blackwell processors
LaunchedQ4 2024
Designed byNvidia
Manufactured by
Fabrication processTSMC 4NP (Datacenter[1])
TSMC 4N (Consumer[2])
Quick facts Launched, Designed by ...
Blackwell
LaunchedQ4 2024
Designed byNvidia
Manufactured by
Fabrication processTSMC 4NP (Datacenter[1])
TSMC 4N (Consumer[2])
CodenamesGB100
GB20x
Product Series
Desktop
Professional/workstation
  • RTX PRO Blackwell series
Specifications
Memory supportGDDR7 (Consumer)
HBM3E (Datacenter)
PCIe supportPCIe 5.0 (Consumer)
PCIe 6.0 (Datacenter)
Supported Graphics APIs
DirectXDirectX 12 Ultimate (Feature Level 12_2)
Direct3DDirect3D 12
Shader ModelShader Model 6.8
OpenGLOpenGL 4.6
VulkanVulkan 1.4
Supported Compute APIs
OpenCLOpenCL 3.0 (64-bit only, 32-bit support removed)[3]
CUDACompute Capability 10.x (64-bit only, 32-bit support removed)[3]
Compute Capability 12.x (64-bit only)[3]
DirectComputeYes
Media Engine
Encoder supportedNVENC
History
PredecessorAda Lovelace (consumer)
Hopper (datacenter)
SuccessorRubin
Close

Named after statistician and mathematician David Blackwell, the name of the Blackwell architecture was leaked in 2022 with the B40 and B100 accelerators being confirmed in October 2023 with an official Nvidia roadmap shown during an investors presentation.[4] It was officially announced at Nvidia's GTC 2024 keynote on March 18, 2024.[5]

History

David Blackwell (1919–2010)

In March 2022, Nvidia announced the Hopper datacenter architecture for AI accelerators. Demand for Hopper products was high throughout 2023's AI hype.[6] The lead time from order to delivery of H100-based servers was between 36 and 52 weeks due to shortages and high demand.[7] Nvidia reportedly sold 500,000 Hopper-based H100 accelerators in Q3 2023 alone.[7] Nvidia's AI dominance with Hopper products led to the company increasing its market capitalization to over $2 trillion, behind only Microsoft and Apple.[8]

The Blackwell architecture is named after American mathematician David Blackwell who was known for his contributions to the mathematical fields of game theory, probability theory, information theory, and statistics. These areas have influenced or are implemented in transformer-based generative AI model designs or their training algorithms. Blackwell was the first African American scholar to be inducted into the National Academy of Sciences.[9]

In Nvidia's October 2023 Investor Presentation, its datacenter roadmap was updated to include reference to its B100 and B40 accelerators and the Blackwell architecture.[10][11] Previously, the successor to Hopper was simply named on roadmaps as "Hopper-Next". Nvidia's updated roadmap emphasized the move from a two-year release cadence for datacenter products to yearly releases targeted for x86 and ARM systems.

At the Graphics Technology Conference (GTC) on March 18, 2024, Nvidia officially announced the Blackwell architecture with focus placed on its B100 and B200 datacenter accelerators and associated products, such as the eight-GPU HGX B200 board and the 72-GPU NVL72 rack-scale system.[12] Nvidia CEO Jensen Huang said that with Blackwell, "we created a processor for the generative AI era" and emphasized the overall Blackwell platform combining Blackwell accelerators with Nvidia's ARM-based Grace CPU.[13][14] Nvidia touted endorsements of Blackwell from the CEOs of Google, Meta, Microsoft, OpenAI and Oracle.[14] The keynote did not mention gaming.

It was reported in October 2024 that there was a design flaw in the Blackwell architecture that had been fixed in collaboration with TSMC.[15] According to Huang, the design flaw was "functional" and "caused the yield[s] to be low".[16] By November 2024, Morgan Stanley was reporting that "the entire 2025 production" of Blackwell silicon was "already sold out".[17]

During the company's CES 2025 keynote, Nvidia announced that the foundation models for Blackwell will include models from Black Forest Labs (Flux), Meta AI, Mistral AI, and Stability AI.[18]

Architecture

Blackwell is an architecture designed for both datacenter compute applications, and for gaming and workstation applications with dedicated dies for each purpose.

Process node

Blackwell is fabricated on the custom 4NP process node for datacenter products, and on the custom 4N process node for consumer products, from TSMC. 4NP is an enhancement of the 4N node used for the Hopper and Ada Lovelace architectures. The Nvidia-specific 4NP process likely adds metal layers to the standard TSMC N4P technology.[19] The GB100 die contains 104 billion transistors, a 30% increase over the 80 billion transistors in the previous generation Hopper GH100 die.[20] As Blackwell cannot reap the benefits that come with a major process node advancement, it must achieve power efficiency and performance gains through underlying architectural changes.[21]

The GB100 die is at the reticle limit of semiconductor fabrication.[22] The reticle limit in semiconductor fabrication is the maximum size of features that lithography machines can etch into a silicon die. Previously, Nvidia had nearly hit TSMC's reticle limit with GH100's 814 mm2 die. In order to not be constrained by die size, Nvidia's B100 accelerator utilizes two GB100 dies in a single package, connected with a 10 TB/s link that Nvidia calls the NV-High Bandwidth Interface (NV-HBI). NV-HBI is based on the NVLink 7 protocol. Nvidia CEO Jensen Huang said in an interview with CNBC that Nvidia had spent around $10 billion in research and development for Blackwell's NV-HBI die interconnect. Veteran semiconductor engineer Jim Keller, who had worked on AMD's K7, K12 and Zen architectures, criticized this figure and claimed that the same outcome could be achieved for $1 billion through using Ultra Ethernet rather than the proprietary NVLink system.[23] The two connected GB100 dies are able to act like a large monolithic piece of silicon with full cache coherency between both dies.[24] The dual die package totals 208 billion transistors.[22] Those two GB100 dies are placed on top of a silicon interposer produced using TSMC's CoWoS-L 2.5D packaging technique.[25]

On the consumer side, Blackwell's largest die, GB202, measures in at 750mm2 which is 20% larger than AD102, Ada Lovelace's largest die.[26] GB202 contains a total of 24,576 CUDA cores, 28.5% more than the 18,432 CUDA cores in AD102. GB202 is the largest consumer die designed by Nvidia since the 754mm2 TU102 die in 2018, based on the Turing microarchitecture. The gap between GB202 and GB203 has also gotten much wider compared to previous generations. GB202 features more than double the number of CUDA cores than GB203 which was not the case with AD102 over AD103.

Streaming multiprocessor

CUDA cores

CUDA Compute Capability 10.0 and Compute Capability 12.0 are added with Blackwell.[27]

Tensor Cores

The Blackwell architecture introduces fifth-generation Tensor Cores for AI compute and performing floating-point calculations. In the data center, Blackwell adds native support for sub-8-bit data types, including new Open Compute Project (OCP) community-defined MXFP6 and MXFP4 microscaling formats to improve efficiency and accuracy in low-precision computations.[28][29][30][31][32] The previous Hopper architecture introduced the Transformer Engine, software to facilitate quantization of higher-precision models (e.g., FP32) to lower precision, for which Hopper has greater throughput. Blackwell's second-generation Transformer Engine adds support for MXFP4 and MXFP6. Using 4-bit data allows greater efficiency and throughput for model inference during generative AI training. Nvidia claims 20 petaflops (excluding the 2x gain the company claims for sparsity) of FP4 compute for the dual-GPU GB200 superchip.[33]

Ray Tracing Cores

The fourth generation of ray tracing cores are introduced in Blackwell and include a new Triangle Cluster Intersection Engine for Mega Geometry and Linear Swept Spheres for accelerated ray tracing of finer details, like hair.[2]

AI Management Processor

Blackwell introduces an AI Management Processor (AMP), a dedicated scheduler chip on the GPU built on RISC-V.[2] It is designed to offload scheduling from the CPU to a greater degree than what previous generations did and helps the GPU better control its own resources. It is utilized through Windows Hardware-Accelerated GPU Scheduling (HAGS).

Blackwell dies

Datacenter

More information Die, GB100 ...
Die GB100 GB102 GB200
Variant(s) Unknown Unknown Unknown
Release date Dec 2024 Nov 2024 Unknown
Cores CUDA Cores 18,432
TMUs 576
ROPs 24
RT Cores Unknown Unknown Unknown
Tensor Cores 576
Streaming Multiprocessors Unknown Unknown Unknown
Cache L1 8.25 MB
L2 60 MB
Memory interface 8192-bit
Die size Unknown Unknown Unknown
Transistor count 104 bn.
Transistor density Unknown Unknown Unknown
Package socket SXM6
Products B200 SXM 192GB B100 Unknown
Close

Consumer

More information Die, GB10 ...
Die GB10 GB202 GB203 GB205 GB206 GB207
Variant(s) GB202-300-A1 GB203-200-A1
GB203-300-A1
GB203-400-A1
GB205-300-A1 GB206-250-A1
GB206-300-A1
GB207-300-A1
Release date Oct 15th, 2025 Jan 30, 2025 Jan 30, 2025 Mar 4, 2025 Apr 16, 2025 Jun 24, 2025
Cores CUDA Cores 6,144 24,576 10,752 6,400 4,608 2,560
TMUs 384 768 336 200 144 80
ROPs 48 192 112 80 48 32
RT Cores 48 192 84 50 36 20
Tensor Cores 384 768 336 200 144 80
SMs 48 192 84 50 36 20
GPCs 12 7 5 3 2
Cache L1 128KB (per SM) 24 MB 10.5 MB 6.25 MB 4.5 MB 2.5 MB
L2 50 MB 128 MB 64 MB 48 MB 32 MB 32 MB
Memory interface 256-bit 512-bit 256-bit 192-bit 128-bit 128-bit
Die size Unknown 750 mm2 378 mm2 263 mm2 181 mm2 149 mm2
Transistor count Unknown 92.2 bn. 45.6 bn. 31.1 bn. 21.9 bn. 16.9 bn.
Transistor density Unknown 122.6 MTr/mm2 120.6 MTr/mm2 118.3 MTr/mm2 121.0 MTr/mm2 113.4 MTr/mm2
Products
Consumer Desktop N/A RTX 5090
RTX 5090 D
RTX 5070 Ti
RTX 5080   
RTX 5070 RTX 5060
RTX 5060 Ti
RTX 5050
Mobile N/A N/A RTX 5090 Laptop
RTX 5080 Laptop
RTX 5070 Ti Laptop RTX 5060 Laptop
RTX 5070 Laptop
RTX 5050 Laptop
Workstation Desktop DGX Spark GB20B RTX PRO 5000
RTX PRO 6000
RTX PRO 4000
RTX PRO 4500
N/a N/a N/a
Mobile N/A N/A N/A RTX PRO 3000 Mobile RTX PRO 2000 Mobile RTX PRO 500 Mobile
RTX PRO 1000 Mobile
Server RTX PRO 6000 Server Edition RTX PRO 4500 Server Edition N/a N/a N/a
Close

See also

References

Related Articles

Wikiwand AI