TensorRT-LLM
NVIDIA TensorRT–based library for optimized LLM inference on GPUs with multi-GPU and speculative decoding features.
Why it is included
Open-source (Apache-2.0) serving path when you standardize on NVIDIA datacenter GPUs.
Best for
Production LLM serving on NVIDIA hardware with maximum kernel optimization.
Strengths
- NVIDIA kernels
- Multi-GPU
- Broad model recipes
Limitations
- Vendor-tied; build complexity
Good alternatives
vLLM · SGLang
Related tools
AI & Machine Learning
vLLM
High-throughput LLM serving with PagedAttention, continuous batching, and OpenAI-compatible APIs for GPU clusters.
AI & Machine Learning
PyTorch
Deep learning framework with strong research-to-production paths.
AI & Machine Learning
SGLang
Structured generation language for fast serving: RadixAttention, constrained decoding, and multi-turn batching for frontier-class workloads.
AI & Machine Learning
NVIDIA Triton Inference Server
Multi-framework inference server for TensorRT, ONNX, PyTorch, Python backends—dynamic batching, ensembles, and GPU sharing.
AI & Machine Learning
rtp-llm
Alibaba’s high-performance LLM inference engine (CUDA-focused) for production serving of diverse decoder architectures.
AI & Machine Learning
Ollama
Local LLM runner and model library with simple CLI and API for workstation inference.
