VLLM

vLLM is an open-source software framework for inference and serving of large language models and related multimodal models. Originally developed at the University of California, Berkeley's Sky Computing Lab, the project is centered on PagedAttention, a memory-management method for transformer key–value caches, and supports features such as continuous batching, distributed inference, quantization, and OpenAI-compatible APIs.^[1]^[2]^[3] According to a project maintainer, the "v" in vLLM originally referred to "virtual", inspired by virtual memory.^[4]

Original authorsSky Computing Lab
Cal Berkeley

DevelopervLLM contributors

Initial release2023

Written inPython, CUDA, C++

Quick facts Original authors, Developer ...

vLLM

Original authors	Sky Computing Lab Cal Berkeley
Developer	vLLM contributors
Initial release	2023
Written in	Python, CUDA, C++
Type	Large language model inference engine
License	Apache License 2.0
Website	vllm.ai
Repository	github.com/vllm-project/vllm

Close

History

vLLM was introduced in 2023 by researchers affiliated with the Sky Computing Lab at UC Berkeley.^[2]^[1] Its core ideas were described in the 2023 paper Efficient Memory Management for Large Language Model Serving with PagedAttention,^[5] which presented the system as a high-throughput and memory-efficient serving engine for large language models.^[2]

In 2025, the PyTorch Foundation announced that vLLM had become a Foundation-hosted project. PyTorch's project page states that the University of California, Berkeley contributed vLLM to the Linux Foundation in July 2024.^[6]^[3]

In January 2026, TechCrunch reported that the creators of vLLM had launched the startup Inferact to commercialize the project, raising $150 million in seed funding.^[7]

Architecture

According to its 2023 paper, vLLM was designed to improve the efficiency of large language model serving by reducing memory waste in the key–value cache used during transformer inference.^[2] The paper introduced PagedAttention, an algorithm inspired by virtual memory and paging techniques in operating systems, and described vLLM as using block-level memory management and request scheduling to increase throughput while maintaining similar latency.^[2]

The project documentation and repository describe support for continuous batching, chunked prefill, speculative decoding, prefix caching, quantization, and multiple forms of distributed inference and serving.^[1]^[3] PyTorch has described vLLM as a high-throughput, memory-efficient inference and serving engine that supports a range of hardware back ends, including NVIDIA and AMD GPUs, Google TPUs, AWS Trainium, and Intel processors.^[6]^[3]

History

Architecture

See also

References

External links

Related Articles