Skip to content
OpenCatalogcurated by FLOSSK
AI & Machine Learning

ExLlamaV2

Memory-efficient CUDA inference kernels for quantized Llama-class models—popular in consumer GPU chat UIs.

Why it is included

Reference implementation for fast GPTQ/EXL2-style serving on single GPUs.

Best for

Power users maximizing tokens/sec on one NVIDIA card.

Strengths

  • Quantized speed
  • Tight VRAM use
  • Community integrations

Limitations

  • NVIDIA-focused; ecosystem moves fast

Good alternatives

llama.cpp · vLLM

Related tools