Table of Contents
Fetching ...

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen

Abstract

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Abstract

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.

Paper Structure

This paper contains 42 sections, 25 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Performance trade-offs on AIME25 (Qwen3-8B). (A) At equivalent accuracy (40.8%), TriAttention achieves 2.5$\times$ higher throughput than Full Attention. (B) TriAttention reduces KV cache memory by 10.7$\times$ while matching Full Attention accuracy.
  • Figure 2: Q/K concentration and its implications for attention. (A) Pre-RoPE Q/K vectors at the dominant frequency band are highly concentrated (high Mean Resultant Length $R$). (B) RoPE rotation disperses these vectors into arc patterns. In (A-B), three distinct input sequences are overlayed, showing this structure is stable across content. (C) This concentration holds across nearly all heads. (D) When Q/K are concentrated, attention logits can be accurately reconstructed using a trigonometric series (Pearson $r = 0.72$).
  • Figure 3: Attention reconstruction correlation across three DeepSeek-R1 distilled LLMs, including Qwen3 qwen2025qwen3, Qwen2.5 qwen2024qwen25, and Llama3 dubey2024llama3. Distribution of per-head reconstruction Pearson correlation ($\bar{r}$) across all attention heads. The red dashed line indicates the mean. All models show right-skewed distributions with means above 0.5.
  • Figure 4: Method overview. From left to right: offline calibration computes Q distribution centers; then during inference, original attention is scored by combining $S_{\text{trig}}$ and norm-based components; the rightmost panel shows the attention map after pruning. We observe that some heads exhibit distance preference---distant keys tend to receive higher attention. However, we also find that certain keys, despite being far from the query, receive little attention due to their low norms. This motivates our two scoring components: $S_{\text{trig}}$ captures distance preference, while the norm-based score identifies low-norm keys. In this example, $S_{\text{trig}}$ correctly assigns low scores to nearby keys, while the norm-based score identifies the earliest token (leftmost) as unimportant due to its low norm, despite its maximal distance. Together, they accurately identify tokens that will not be attended to and prune them. See Appendix \ref{['app:method-visualization']} for visualizations with real attention maps.
  • Figure 5: Performance comparison on Qwen3-8B. (A--C) Accuracy vs. KV cache budget on three mathematical reasoning benchmarks. TriAttention consistently outperforms R-KV across all budget levels. (D) Memory retention on Recursive State Query benchmark. Depth refers to DFS recursion depth; deeper recursion requires retaining more intermediate states, increasing memory pressure.
  • ...and 3 more figures