PagedAttention Paper (2023)

What Happened

In September 2023, the PagedAttention paper was posted to arXiv, describing KV-cache memory issues in batched LLM serving and proposing paging-inspired management to reduce fragmentation and waste.

Why It Matters

The work contributed to a wave of inference engineering research that treats LLM serving as a systems problem, not just a modeling problem—supporting broader adoption by lowering cost per token.

Technical Details

PagedAttention draws inspiration from virtual memory: it organizes KV-cache storage into fixed-size “pages,” enabling more flexible allocation and less fragmentation under dynamic sequence lengths.

Sources

Paper (arXiv)
vLLM blog post