Papers

Research Papers

The foundational and breakthrough papers driving the open-source LLM revolution.

Mixtral of Experts

Jiang et al. (Mistral AI) · 2024-01-08

Sparse Mixture-of-Experts model using 8 expert networks with top-2 routing. Matches or outperforms Llama 2 70B while using only 13B active parameters per token.

MoEsparseefficientMistral

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon et al. (UC Berkeley) · 2023-09-12

Applied virtual memory concepts to KV cache management in LLM serving. PagedAttention eliminates memory fragmentation, enabling 2-4x throughput improvement and forming the basis of vLLM.

servingmemoryKV-cachevLLM

LLaMA: Open and Efficient Foundation Language Models

Touvron et al. (Meta) · 2023-02-27

Demonstrated that smaller models trained on more tokens can match larger models. Catalyzed the open-source LLM movement by releasing weights from 7B to 65B parameters.

open-sourcescaling-lawsMeta

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao et al. · 2022-05-27

Proposed an IO-aware attention algorithm that uses tiling to reduce HBM reads/writes by orders of magnitude. Enabled 2-4x wall-clock speedup and longer sequences without approximation.

attentionoptimizationmemory-efficient

LoRA: Low-Rank Adaptation of Large Language Models

Hu et al. (Microsoft) · 2021-06-17

Introduced low-rank decomposition for parameter-efficient fine-tuning. Freezes pretrained weights and injects trainable rank decomposition matrices, reducing trainable parameters by 10,000x.

fine-tuningparameter-efficientadaptation

Attention Is All You Need

Vaswani et al. · 2017-06-12

Introduced the Transformer architecture based entirely on self-attention mechanisms, replacing recurrence and convolutions. The foundational architecture behind all modern LLMs.

transformerattentionfoundational