Skip to content

OpenCatalogcurated by FLOSSK

AI & Machine Learning

ExLlamaV2

Memory-efficient CUDA inference kernels for quantized Llama-class models—popular in consumer GPU chat UIs.

Official site Source repository

Why it is included

Reference implementation for fast GPTQ/EXL2-style serving on single GPUs.

Best for

Power users maximizing tokens/sec on one NVIDIA card.

Strengths

Quantized speed
Tight VRAM use
Community integrations

Limitations

NVIDIA-focused; ecosystem moves fast

Good alternatives

llama.cpp · vLLM

Related tools