Inference

Inference Engines

Open-source frameworks for serving and running LLMs — from cloud-scale GPU clusters to local laptops.

llama.cpp

Georgi Gerganov

C/C++ inference of LLMs with minimal dependencies. Supports GGUF quantization, runs on CPU and Apple Silicon with optional GPU offloading. The backbone of local LLM inference.

localCPUApple-SiliconGGUFquantization

Ollama

110k+

Run LLMs locally with a simple CLI. Wraps llama.cpp with model management, an HTTP API, and one-command model downloads. The easiest way to get started with local models.

localCLIeasy-setupmodel-management

SGLang

LMSYS

20k+

Fast serving framework with RadixAttention for automatic KV cache reuse across requests. Features a frontend language for complex LLM programs with parallelism.

servingRadixAttentionKV-cache-reuseGPU

TensorRT-LLM

NVIDIA

10k+

NVIDIA's optimized inference library for LLMs on NVIDIA GPUs. Leverages TensorRT for kernel fusion, in-flight batching, and FP8 quantization for maximum throughput.

NVIDIAGPU-optimizedFP8production

Text Generation Inference

Hugging Face

10k+

Production-ready inference server for LLMs. Features continuous batching, Flash Attention, quantization support (GPTQ/AWQ), and seamless Hugging Face Hub integration.

servingproductionHugging-FaceGPU

vLLM

vLLM Team

45k+

High-throughput LLM serving engine featuring PagedAttention for efficient memory management. Supports continuous batching, tensor parallelism, and OpenAI-compatible API.

servingPagedAttentioncontinuous-batchingGPU