Table of Contents
Fetching ...

MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

Jinghan Yao, Sam Adé Jacobs, Walid Krichene, Masahiro Tanaka, Dhabaleswar K Panda

Abstract

Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git

MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

Abstract

Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git

Paper Structure

This paper contains 60 sections, 13 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Accuracy vs. KV budget (the fraction of KV cache used in each decoding step) across long‑context benchmarks. Top: LongBench v2 (up to 120K context) bai2024longbenchv2. Bottom left: LongGenBench (up to 16K continuous generation) wu2024longgenbench. Bottom right: RULER (120K context) hsieh2024ruler. MAC‑Attention is highlighted (dark pink); Full attention shown as a gray dashed line.
  • Figure 2: MAC‑Attention: Match–Amend–Complete with async cache update.(a) Match: at position $m$, compare the pre‑RoPE query $\tilde{Q}_m$ against a small ring of recent queries; if the nearest match $p$ passes an L2 threshold, fetch its attention summary $\mathrm{AS}_{1:p-r}$ (otherwise run full attention). (b) Complete (critical path): with the post‑RoPE query $q_m$, compute attention only on the band+tail $[\,p{-}r{+}1,\,m\,]$ and log‑domain merge $\mathrm{AS}_{1:p-r} \oplus \mathrm{AS}_{p-r+1:m}$; the KV $[1,p{-}r]$ is not accessed. (c) Amend (async): compute a rectification term $\mathrm{AS}_{r+1:m}$ and update the cache via online subtraction to obtain $\mathrm{AS}_{m-r}$; insert $\tilde{q}_m, \mathrm{AS}_{m-r}$ into the rings. Symbols: $p$—match position, $r$—band width; $\oplus$—log‑domain merge, $\ominus$—log‑domain removal. Shaded regions denote KV segments not re‑read under reuse.
  • Figure 3: Decode micro‑pipeline for MAC-Attention.(a): per‑request rings L2 match with SplitQ design; (b): per‑head work spans differ because the reuse point $p$ and band size $r$ vary, the schedule assigns more CTAs to longer band+tail spans as shown in green blocks; many heads do a cheap merge when reuse is strong; (c): Overview of MAC-Attention workflow.
  • Figure 4: Rectification error vs. reuse gap and band width. Layerwise heatmaps show the normalized output error as a function of reuse gap ($\Delta$) and rectification band width ($r$); representative layers are displayed.
  • Figure 5: Layerwise reuse patterns. Acceptance rate (left) and skipped‑prefix fraction (right) per layer. Top: threshold sweep at fixed window size. Bottom: window‑size sweep at fixed threshold.
  • ...and 5 more figures