VLLM
Open-source software for large language model inference
From Wikipedia, the free encyclopedia
vLLM is an open-source software framework for inference and serving of large language models and related multimodal models. Originally developed at the University of California, Berkeley's Sky Computing Lab,[1] the project is centered on PagedAttention, a memory-management method for transformer key–value caches, and supports features such as continuous batching, distributed inference, quantization, and OpenAI-compatible APIs.[2][3][4] According to a project maintainer, the "v" in vLLM originally referred to "virtual", inspired by virtual memory.[5]
| vLLM | |
|---|---|
| Original authors | Sky Computing Lab Cal Berkeley |
| Developer | vLLM contributors |
| Initial release | 2023 |
| Written in | Python, CUDA, C++ |
| Type | Large language model inference engine |
| License | Apache License 2.0 |
| Website | vllm |
| Repository | github |
History
vLLM was introduced in 2023 by researchers affiliated with the Sky Computing Lab at UC Berkeley.[3][2] Its core ideas were described in the 2023 paper Efficient Memory Management for Large Language Model Serving with PagedAttention,[6] which presented the system as a high-throughput and memory-efficient serving engine for large language models.[3]
In 2025, the PyTorch Foundation announced that vLLM had become a Foundation-hosted project. PyTorch's project page states that the University of California, Berkeley contributed vLLM to the Linux Foundation in July 2024.[7][4]
In January 2026, TechCrunch reported that the creators of vLLM had launched the startup Inferact to commercialize the project, raising $150 million in seed funding.[8]
Architecture
According to its 2023 paper, vLLM was designed to improve the efficiency of large language model serving by reducing memory waste in the key–value cache used during transformer inference.[3] The paper introduced PagedAttention, an algorithm inspired by virtual memory and paging techniques in operating systems, and described vLLM as using block-level memory management and request scheduling to increase throughput while maintaining similar latency.[3]
The project documentation and repository describe support for continuous batching, chunked prefill, speculative decoding, prefix caching, quantization, and multiple forms of distributed inference and serving.[2][4] PyTorch has described vLLM as a high-throughput, memory-efficient inference and serving engine that supports a range of hardware back ends, including NVIDIA and AMD GPUs, Google TPUs, AWS Trainium, and Intel processors.[7][4]