ExLlamaV2
Memory-efficient CUDA inference kernels for quantized Llama-class models—popular in consumer GPU chat UIs.
Why it is included
Reference implementation for fast GPTQ/EXL2-style serving on single GPUs.
Best for
Power users maximizing tokens/sec on one NVIDIA card.
Strengths
- Quantized speed
- Tight VRAM use
- Community integrations
Limitations
- NVIDIA-focused; ecosystem moves fast
Good alternatives
llama.cpp · vLLM
Related tools
AI & Machine Learning
llama.cpp
Plain C/C++ inference for LLaMA-class models with broad community backends.
AI & Machine Learning
vLLM
High-throughput LLM serving with PagedAttention, continuous batching, and OpenAI-compatible APIs for GPU clusters.
AI & Machine Learning
Ollama
Local LLM runner and model library with simple CLI and API for workstation inference.
AI & Machine Learning
MLX LM
Apple MLX-based LLM inference and training on Apple silicon: efficient Metal-backed transformers and examples for local chat models.
AI & Machine Learning
llamafile
Single-file distributable LLM weights + llama.cpp runtime: run large models from one executable with broad OS CPU/GPU support.
AI & Machine Learning
SGLang
Structured generation language for fast serving: RadixAttention, constrained decoding, and multi-turn batching for frontier-class workloads.
