VLLM
Open-source software for large language model inference
From Wikipedia, the free encyclopedia
vLLM is an open-source software framework for inference and serving of large language models and related multimodal models. Originally developed at the University of California, Berkeley's Sky Computing Lab, the project is centered on PagedAttention, a memory-management method for transformer key–value caches, and supports features such as continuous batching, distributed inference, quantization, and OpenAI-compatible APIs.[1][2][3] According to a project maintainer, the "v" in vLLM originally referred to "virtual", inspired by virtual memory.[4]
| vLLM | |
|---|---|
| Original authors | Sky Computing Lab Cal Berkeley |
| Developer | vLLM contributors |
| Initial release | 2023 |
| Written in | Python, CUDA, C++ |
| Type | Large language model inference engine |
| License | Apache License 2.0 |
| Website | vllm |
| Repository | github |
History
vLLM was introduced in 2023 by researchers affiliated with the Sky Computing Lab at UC Berkeley.[2][1] Its core ideas were described in the 2023 paper Efficient Memory Management for Large Language Model Serving with PagedAttention,[5] which presented the system as a high-throughput and memory-efficient serving engine for large language models.[2]
In 2025, the PyTorch Foundation announced that vLLM had become a Foundation-hosted project. PyTorch's project page states that the University of California, Berkeley contributed vLLM to the Linux Foundation in July 2024.[6][3]
In January 2026, TechCrunch reported that the creators of vLLM had launched the startup Inferact to commercialize the project, raising $150 million in seed funding.[7]
Architecture
According to its 2023 paper, vLLM was designed to improve the efficiency of large language model serving by reducing memory waste in the key–value cache used during transformer inference.[2] The paper introduced PagedAttention, an algorithm inspired by virtual memory and paging techniques in operating systems, and described vLLM as using block-level memory management and request scheduling to increase throughput while maintaining similar latency.[2]
The project documentation and repository describe support for continuous batching, chunked prefill, speculative decoding, prefix caching, quantization, and multiple forms of distributed inference and serving.[1][3] PyTorch has described vLLM as a high-throughput, memory-efficient inference and serving engine that supports a range of hardware back ends, including NVIDIA and AMD GPUs, Google TPUs, AWS Trainium, and Intel processors.[6][3]