SGLang

Open-source framework for large language model inference From Wikipedia, the free encyclopedia

SGLang (short for Structured Generation Language) is an open-source framework for programming and serving large language models and multimodal models. It was introduced by researchers affiliated with LMSYS[1] and other institutions as a system combining a Python-embedded language for structured generation with a runtime for high-throughput inference.[2][3][4]

Initial releaseJanuary 17, 2024; 2 years ago (2024-01-17)
Written inPython, Rust, CUDA, C++
Quick facts Developer, Initial release ...
SGLang
DeveloperLMSYS
Initial releaseJanuary 17, 2024; 2 years ago (2024-01-17)
Written inPython, Rust, CUDA, C++
TypeLarge language model inference engine
LicenseApache License 2.0
Websitesglang.io
Repositorygithub.com/sgl-project/sglang
Close

The project is designed for low latency and high-throughput inference workloads, and its documentation describes support for features such as structured outputs, speculative decoding, continuous batching, quantization, and compatibility with OpenAI-style APIs.[5]

History

SGLang was publicly introduced in January 2024 by researchers affiliated with Stanford, UC Berkeley, Texas A&M, and Shanghai Jiao Tong University.[2] Its academic description later appeared in the proceedings of NeurIPS 2024.[3] In January 2026, TechCrunch reported that contributors associated with the project had formed the startup RadixArk to commercialize services around SGLang while continuing its open-source development.[6][7]

Architecture

According to the NeurIPS paper, SGLang consists of two main components: a front-end language embedded in Python and a back-end runtime for executing language model programs efficiently.[3] The front end provides primitives for generation, selection, and parallel control flow, while the runtime uses a set of optimizations intended to reduce repeated computation and improve throughput.[3]

Among the techniques described by the project are RadixAttention for reusing key–value cache state across multiple generation calls, compressed finite-state machines for faster constrained decoding, and speculative execution for API-based models.[3] The current documentation also describes support for serving both language models and multimodal models across a range of hardware back ends.[5]

See also

References

Related Articles

Wikiwand AI