Inference Optimizations
The techniques that make it possible to run powerful LLMs on commodity hardware — from quantization to memory management.
AWQ
Activation-aware weight quantization that protects salient weight channels based on activation magnitude. Achieves better quality than naive quantization at 4-bit precision.
FlashAttention
IO-aware exact attention algorithm that reduces memory reads/writes by tiling and recomputation. Provides 2-4x speedup and enables longer context lengths without approximation.
GGUF Quantization
File format and quantization scheme for llama.cpp. Supports mixed-precision quantization (Q2-Q8) enabling models to run on CPUs and Apple Silicon with configurable quality-speed tradeoffs.
GPTQ
Post-training quantization method that compresses model weights to 4-bit or 3-bit precision using approximate second-order information. Enables running large models on consumer GPUs.
PagedAttention
Virtual memory-inspired KV cache management that eliminates memory fragmentation. Enables near-zero waste in KV cache allocation, dramatically increasing batch sizes and throughput.
Speculative Decoding
Uses a smaller draft model to generate candidate tokens that the larger model verifies in parallel. Achieves 2-3x faster decoding without any quality loss.