PagedAttention
Attention algorithm for efficient large language model serving
From Wikipedia, the free encyclopedia
PagedAttention is an attention algorithm for efficient serving of large language models (LLMs). It was introduced in 2023 by Woosuk Kwon and colleagues in the paper Efficient Memory Management for Large Language Model Serving with PagedAttention,[1] alongside the vLLM serving engine. The method stores the key–value cache used during autoregressive decoding in fixed-size blocks that can be mapped to non-contiguous physical memory, borrowing ideas from virtual memory, paging, and operating system design.[2][3]
Background
In Transformer inference, the key–value cache grows with sequence length and the number of concurrent requests. Kwon et al. argued that earlier serving systems typically reserved contiguous cache regions in advance, which caused reserved space, internal fragmentation, and external fragmentation. In their experiments, the paper reported that the effective memory utilization of previous systems could fall as low as 20.4%.[2]
Description
PagedAttention partitions the cache of each sequence into fixed-size KV blocks. A request's cache is represented as a sequence of logical blocks, while a block table maps those logical blocks to physical GPU-memory blocks. As a result, neighboring logical blocks do not need to be contiguous in physical memory, and new blocks can be allocated on demand as generation proceeds.[2]
The design also makes it easier to share cache state across related decoding paths. In vLLM, physical blocks can be reference-counted and shared among requests or branches, with block-granularity copy-on-write used when a shared block must be modified. The original paper applied this design to parallel sampling, beam search, and prompts with shared prefixes.[2]
Mathematical formulation
For a query token in causal self-attention, the standard attention output can be written as where , , and are the query, key, and value vectors, and is the attention dimension.[2]
If the cache is partitioned into blocks of size , the key and value blocks may be written as PagedAttention then performs the computation blockwise: where is the vector of attention scores for the -th KV block. In the formulation given by Kwon et al., this preserves the causal attention calculation while allowing the key and value blocks to reside in non-contiguous physical memory.[2]
Performance and use
The vLLM paper reported that, on its evaluated workloads, the use of PagedAttention and the associated memory-management design improved serving throughput by 2–4× over the compared baselines, including FasterTransformer and Orca, while preserving model outputs. In experiments on OPT-13B with the Alpaca trace, the paper also reported memory savings of 6.1–9.8% for parallel sampling and 37.6–55.2% for beam search through KV-block sharing.[2]
A 2024 survey of LLM serving systems described PagedAttention as having become an industry norm in LLM serving frameworks, citing support in TGI, vLLM, and TensorRT-LLM.[3]
Limitations and alternatives
Subsequent work has described trade-offs in the approach. The 2025 vAttention paper argued that PagedAttention requires attention kernels to be rewritten to support paging and increases software complexity, portability issues, redundancy, and execution overhead, proposing instead a memory manager that keeps the cache contiguous in virtual memory while relying on demand paging for physical allocation.[4]
vAttention
Unlike PagedAttention, vAttention does not introduce a different attention rule; it retains the standard attention computation In the notation of Prabhu et al., the key and value tensors for a request seen so far are , where is the context length seen so far, is the number of KV heads on a worker, and is the dimension of each KV head.[4]
In systems prior to PagedAttention, the K cache (or V cache) at each layer of a worker is typically allocated as a 4D tensor of shape where is batch size and is the maximum context length supported by the model.[4]
vAttention preserves this contiguous virtual-memory view while deferring physical-memory allocation to runtime. A serving framework maintains separate K and V tensors for each layer, so vAttention reserves virtual-memory buffers on a worker, where is the number of layers managed by that worker.[4]
The maximum size of one virtual-memory buffer is where is the maximum size of a single request's per-layer K cache (or V cache) on a worker. The paper defines where is the number of bytes needed to store one element.[4]
In this formulation, vAttention keeps the KV cache contiguous in virtual memory and relies on demand paging for physical allocation, rather than modifying the attention kernel to operate over non-contiguous KV-cache blocks.[4]