TensorRT

TensorRT is a software development kit (SDK) and inference optimization runtime developed by Nvidia for deploying trained deep learning and machine learning models on graphics processing units (GPUs).^[1]^[2] It can import models from frameworks such as PyTorch, TensorFlow, and ONNX, and compile them into optimized runtime engines for low-latency and high-throughput inference.^[1]^[2]

DeveloperNvidia

Initial release2017

Written inC++, Python, CUDA

Operating systemLinux, Windows

Quick facts Developer, Initial release ...

TensorRT

Developer	Nvidia
Initial release	2017
Written in	C++, Python, CUDA
Operating system	Linux, Windows
Platform	Nvidia GPUs
Type	Software development kit, Inference engine
License	Proprietary software; some companion components are open-source under the Apache License 2.0
Website	developer.nvidia.com/tensorrt
Repository	github.com/NVIDIA/TensorRT github.com/NVIDIA/TensorRT-LLM

Close

In current Nvidia documentation, the TensorRT name is also used for a broader product family that includes the core TensorRT SDK, TensorRT-LLM, and TensorRT-RTX.^[3] The core SDK is primarily a proprietary Nvidia product, although Nvidia also maintains Apache-licensed open-source TensorRT repositories and related companion projects.^[4]^[5]

History

TensorRT was available as part of Nvidia's deep learning software stack by 2017, when it was described as a high-performance inference engine for deploying trained neural networks on Nvidia GPUs.^[6] In 2018, Google announced integration of Nvidia TensorRT with TensorFlow 1.7, describing TensorRT as a library that optimizes deep learning models for inference and creates a runtime for deployment on GPUs in production environments.^[7]

Overview

The core of TensorRT is a C++ library that takes a trained network, consisting of a network definition and trained parameters, and produces a highly optimized runtime engine for inference on Nvidia GPUs.^[2] TensorRT provides both C++ and Python APIs, and models can either be expressed directly through its network definition API or imported through its ONNX parser.^[2]

According to Nvidia's documentation, TensorRT performs graph-level and kernel-level optimizations such as layer fusion and selection of efficient implementations for supported operations.^[2] Current documentation also describes support for dynamic shapes, mixed-precision execution modes including FP32, FP16, BF16, FP8, and INT8, and specialized optimizations for transformer and large language model workloads.^[1]

TensorRT engines can be generated through the TensorRT APIs or with the trtexec command-line utility.^[8] Nvidia's quick-start documentation describes deployment workflows based on ONNX conversion, runtime APIs, and direct engine deserialization for C++ and Python applications.^[8]

Licensing and open-source components

The licensing model around TensorRT is split between a proprietary core SDK and a set of open-source repositories and tools.^[4]^[5] The packaged TensorRT software distributed by Nvidia is governed by the Nvidia Software License Agreement.^[4] At the same time, Nvidia maintains a public TensorRT repository on GitHub under the Apache License 2.0.^[5]

Official TensorRT documentation also directs users to the TensorRT open-source software repository for quick-start code and samples.^[8] The architecture documentation describes related tooling such as Polygraphy for debugging and constant folding, as well as ONNX-GraphSurgeon for modifying ONNX graphs before deployment with TensorRT.^[9] TensorRT also supports a plugin mechanism for custom layers and unsupported operations.^[8]

Product family

Nvidia's current documentation groups several inference products under the TensorRT name.^[3] In that documentation, the core SDK is distinguished as TensorRT (Enterprise), while related offerings include TensorRT-LLM for large language model inference and TensorRT-RTX for consumer RTX GPUs.^[3]

TensorRT-LLM

TensorRT-LLM is a related open-source toolkit for optimizing and serving large language models on Nvidia GPUs.^[3]^[10] Nvidia describes it as providing a Python API to define LLMs and build TensorRT engines optimized for LLM workloads.^[3]^[10]

According to Nvidia's product-family documentation, TensorRT-LLM supports multi-GPU and multi-node execution, in-flight batching, paged KV cacheing, and quantization methods such as FP8, INT8, and INT4 for higher-throughput model serving.^[3] The TensorRT-LLM codebase is published on GitHub under the Apache License 2.0.^[11]

Because Nvidia documents TensorRT-LLM as a separate member of the TensorRT product family, it is typically treated as a related but distinct software project rather than as a single feature of the base TensorRT SDK.^[3]

History

Overview

Licensing and open-source components

Product family

TensorRT-LLM

See also

External links

References

Related Articles