Inference Engines
Open-source frameworks for serving and running LLMs — from cloud-scale GPU clusters to local laptops.
llama.cpp
Georgi Gerganov
C/C++ inference of LLMs with minimal dependencies. Supports GGUF quantization, runs on CPU and Apple Silicon with optional GPU offloading. The backbone of local LLM inference.
Ollama
Ollama
Run LLMs locally with a simple CLI. Wraps llama.cpp with model management, an HTTP API, and one-command model downloads. The easiest way to get started with local models.
SGLang
LMSYS
Fast serving framework with RadixAttention for automatic KV cache reuse across requests. Features a frontend language for complex LLM programs with parallelism.
TensorRT-LLM
NVIDIA
NVIDIA's optimized inference library for LLMs on NVIDIA GPUs. Leverages TensorRT for kernel fusion, in-flight batching, and FP8 quantization for maximum throughput.
Text Generation Inference
Hugging Face
Production-ready inference server for LLMs. Features continuous batching, Flash Attention, quantization support (GPTQ/AWQ), and seamless Hugging Face Hub integration.
vLLM
vLLM Team
High-throughput LLM serving engine featuring PagedAttention for efficient memory management. Supports continuous batching, tensor parallelism, and OpenAI-compatible API.