Table of Contents
Fetching ...

Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge

TL;DR

This work tackles the computational bottleneck of self-attention by introducing Compressed Convolutional Attention (CCA), which projects Q/K/V into a shared latent space and performs the full attention operation there. By combining with Grouped Query Attention, the authors present Compressed Convolutional Grouped Query Attention (CCGQA), achieving substantial reductions in KV-cache, parameters, and FLOPs while maintaining or improving accuracy on dense and mixture-of-experts (MoE) models. Empirical results show CCA/CCGQA outperform GQA and MLA at equivalent KV-cache compression and enable aggressive KV-cache reductions (up to 8x) with no loss relative to standard MHA in MoE settings, along with meaningful speedups in prefill, forward, and backward passes on H100 GPUs using a fused kernel. The approach demonstrates that parameter sharing and parameter compression are orthogonal and combinable, offering a flexible Pareto frontier to tailor compute-memory trade-offs for various deployment constraints and scales, and it integrates RoPE seamlessly within the latent space unlike MLA.

Abstract

Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache, speeding decode, but leave compute, which determines prefill and training speed, largely unchanged. We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values and performs the entire attention operation inside the shared latent space. This simple design dramatically cuts parameters, KV-cache, and FLOPs all at once by the desired compression factor. Because CCA is orthogonal to head-sharing, we combine the two to form Compressed Convolutional Grouped Query Attention (CCGQA), which further tightens the compute-bandwidth Pareto frontier so that users can tune compression toward either FLOP or memory limits without sacrificing quality. Experiments show that CCGQA consistently outperforms both GQA and MLA at equal KV-cache compression on dense and MoE models. Additionally, we show that CCGQA outperforms all other attention methods on MoE models with half the KV-cache of GQA and MLA, achieving an 8x KV-cache compression with no drop in performance compared to standard MHA. CCA and CCGQA also dramatically reduce the FLOP cost of attention which leads to substantially faster training and prefill than existing methods. On H100 GPUs, our fused CCA/CCGQA kernel reduces prefill latency by about 1.7x at a sequence length of 16k relative to MHA, and accelerates backward by about 1.3x.

Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

TL;DR

This work tackles the computational bottleneck of self-attention by introducing Compressed Convolutional Attention (CCA), which projects Q/K/V into a shared latent space and performs the full attention operation there. By combining with Grouped Query Attention, the authors present Compressed Convolutional Grouped Query Attention (CCGQA), achieving substantial reductions in KV-cache, parameters, and FLOPs while maintaining or improving accuracy on dense and mixture-of-experts (MoE) models. Empirical results show CCA/CCGQA outperform GQA and MLA at equivalent KV-cache compression and enable aggressive KV-cache reductions (up to 8x) with no loss relative to standard MHA in MoE settings, along with meaningful speedups in prefill, forward, and backward passes on H100 GPUs using a fused kernel. The approach demonstrates that parameter sharing and parameter compression are orthogonal and combinable, offering a flexible Pareto frontier to tailor compute-memory trade-offs for various deployment constraints and scales, and it integrates RoPE seamlessly within the latent space unlike MLA.

Abstract

Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache, speeding decode, but leave compute, which determines prefill and training speed, largely unchanged. We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values and performs the entire attention operation inside the shared latent space. This simple design dramatically cuts parameters, KV-cache, and FLOPs all at once by the desired compression factor. Because CCA is orthogonal to head-sharing, we combine the two to form Compressed Convolutional Grouped Query Attention (CCGQA), which further tightens the compute-bandwidth Pareto frontier so that users can tune compression toward either FLOP or memory limits without sacrificing quality. Experiments show that CCGQA consistently outperforms both GQA and MLA at equal KV-cache compression on dense and MoE models. Additionally, we show that CCGQA outperforms all other attention methods on MoE models with half the KV-cache of GQA and MLA, achieving an 8x KV-cache compression with no drop in performance compared to standard MHA. CCA and CCGQA also dramatically reduce the FLOP cost of attention which leads to substantially faster training and prefill than existing methods. On H100 GPUs, our fused CCA/CCGQA kernel reduces prefill latency by about 1.7x at a sequence length of 16k relative to MHA, and accelerates backward by about 1.3x.

Paper Structure

This paper contains 14 sections, 12 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Diagram of the operations involved in the CCA block. This diagram describes the computation of the compressed latent query, key, and value vectors prior to performing standard Flash Attention on the compressed latents. The input $x$ is first down-projected using the $\tilde{W}_Q, \tilde{W}_K, \tilde{W}_V$ matrices, then the two convolution operations are performed followed by the QK-mean operation, then normalization. For the V matrix, we do not apply any convolutions, but instead apply the v-shift operation.
  • Figure 2: Theoretical computational and memory complexity analysis across attention mechanisms with $E=2048$. (a) Parameter counts show compression methods reduce model size. (b) KV-cache memory at long context demonstrates substantial savings. (c) Prefill and (d) decode FLOPs exhibit quadratic and linear scaling with sequence length respectively. Note that these are theoretical FLOP counts. See empirical latency measurements in Figures \ref{['fig:fwd-nocausal-64']}-\ref{['fig:bwd-256']}. Our kernel implementation will be further improved through better operator fusion and memory access patterns. See Appendix \ref{['sec:appendix-mla-inference']} for more details on MLA inference considerations.
  • Figure 3: Comparison of perplexity on the Zyda2 dataset for 1B parameter dense transformer models trained for 300B tokens with different attention mechanisms. CCA beats MLA in the parameter-matched setting with less FLOPs. When matching CCA FLOPs to GQA and MHA via CCGQA, we see a substantial improvement in perplexity.
  • Figure 4: Comparison of perplexity on 50B tokens of the Zyda2 dataset for 350M/1.5B parameter proprietary MoE models with different attention mechanisms. Our proposed methods, CCA and CCGQA, achieve lower loss than GQA and MLA at equivalent parameter counts with less compute cost, and less training parameters in the case of MLA.
  • Figure 5: Performance of CCA versus competing attention methods with hidden dimension 2048 and BFLOAT16 on an H100 GPU
  • ...and 13 more figures