FlashAttention (2022)

What Happened

In May 2022, the FlashAttention paper presented an exact attention computation approach designed to reduce memory reads/writes between GPU memory hierarchies.

Why It Matters

Efficiency techniques like FlashAttention helped make training and inference for large transformer models more practical, especially as context lengths and model sizes increased.

Technical Details

FlashAttention uses tiling and IO-aware scheduling to reduce HBM traffic, improving throughput while preserving exact attention semantics.

Sources

Paper (arXiv)
Official implementation