SGLang
Open-source framework for large language model inference
From Wikipedia, the free encyclopedia
SGLang (short for Structured Generation Language) is an open-source framework for programming and serving large language models and multimodal models. It was introduced by researchers affiliated with LMSYS[1] and other institutions as a system combining a Python-embedded language for structured generation with a runtime for high-throughput inference.[2][3][4]
| SGLang | |
|---|---|
| Developer | LMSYS |
| Initial release | January 17, 2024 |
| Written in | Python, Rust, CUDA, C++ |
| Type | Large language model inference engine |
| License | Apache License 2.0 |
| Website | sglang |
| Repository | github |
The project is designed for low latency and high-throughput inference workloads, and its documentation describes support for features such as structured outputs, speculative decoding, continuous batching, quantization, and compatibility with OpenAI-style APIs.[5]
History
SGLang was publicly introduced in January 2024 by researchers affiliated with Stanford, UC Berkeley, Texas A&M, and Shanghai Jiao Tong University.[2] Its academic description later appeared in the proceedings of NeurIPS 2024.[3] In January 2026, TechCrunch reported that contributors associated with the project had formed the startup RadixArk to commercialize services around SGLang while continuing its open-source development.[6][7]
Architecture
According to the NeurIPS paper, SGLang consists of two main components: a front-end language embedded in Python and a back-end runtime for executing language model programs efficiently.[3] The front end provides primitives for generation, selection, and parallel control flow, while the runtime uses a set of optimizations intended to reduce repeated computation and improve throughput.[3]
Among the techniques described by the project are RadixAttention for reusing key–value cache state across multiple generation calls, compressed finite-state machines for faster constrained decoding, and speculative execution for API-based models.[3] The current documentation also describes support for serving both language models and multimodal models across a range of hardware back ends.[5]