Optimizations

Inference Optimizations

The techniques that make it possible to run powerful LLMs on commodity hardware — from quantization to memory management.

Quantization

AWQ

Activation-aware weight quantization that protects salient weight channels based on activation magnitude. Achieves better quality than naive quantization at 4-bit precision.

quantization4-bitactivation-awareGPU

Attention

FlashAttention

IO-aware exact attention algorithm that reduces memory reads/writes by tiling and recomputation. Provides 2-4x speedup and enables longer context lengths without approximation.

attentionmemory-efficientexactGPU

Quantization

GGUF Quantization

File format and quantization scheme for llama.cpp. Supports mixed-precision quantization (Q2-Q8) enabling models to run on CPUs and Apple Silicon with configurable quality-speed tradeoffs.

quantizationCPUmixed-precisionllama.cpp

Quantization

GPTQ

Post-training quantization method that compresses model weights to 4-bit or 3-bit precision using approximate second-order information. Enables running large models on consumer GPUs.

quantization4-bitpost-trainingGPU

Memory

PagedAttention

Virtual memory-inspired KV cache management that eliminates memory fragmentation. Enables near-zero waste in KV cache allocation, dramatically increasing batch sizes and throughput.

KV-cachememory-managementthroughputvLLM

Decoding

Speculative Decoding

Uses a smaller draft model to generate candidate tokens that the larger model verifies in parallel. Achieves 2-3x faster decoding without any quality loss.

decodingdraft-modellosslesslatency