What Happened
In May 2022, the FlashAttention paper presented an exact attention computation approach designed to reduce memory reads/writes between GPU memory hierarchies.
Why It Matters
Efficiency techniques like FlashAttention helped make training and inference for large transformer models more practical, especially as context lengths and model sizes increased.
Technical Details
FlashAttention uses tiling and IO-aware scheduling to reduce HBM traffic, improving throughput while preserving exact attention semantics.