LongFlow: Efficient KV Cache Compression for Reasoning M

Yi Su; Zhenxu Tian; Dan Qiao; Yuechi Zhou; Juntao Li; Min Zhang

LongFlow: Efficient KV Cache Compression for Reasoning M

Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang

Abstract

Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed for long-input, short-output scenarios and are ineffective for the long-output setting of reasoning models. Moreover, importance estimation in prior work is computationally expensive and becomes prohibitive when continuous re-evaluation is required during long generation. To address these challenges, we propose LongFlow, a KV cache compression method with an efficient importance estimation metric derived from an intermediate result of attention computation using only the current query. This design introduces negligible computational overhead and requires no auxiliary storage. We further develop a custom kernel that fuses FlashAttention, importance estimation, and token eviction into a single optimized operator, improving system-level efficiency. Experiments show that LongFlow achieves up to an 11.8 times throughput improvement with 80% KV cache compression with minimal impact on model accuracy.

LongFlow: Efficient KV Cache Compression for Reasoning M

Abstract

Paper Structure (26 sections, 1 theorem, 23 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 1 theorem, 23 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Preliminary
Background: Attention, KV cache, and Hardware-Aware Optimization
Revisiting KV cache Compression in the Era of Long Reasoning Models
Method
A Lightweight Design Philosophy
Derivation of the Importance Metric
Theoretical Justification of the Approximations
High-Performance Implementation
Experiments
Experimental Setup
Main Results on Model Accuracy
Performance on Different Model Sizes
Throughput and Memory Analysis
Related Work
...and 11 more sections

Key Result

Lemma 1

The softmax function $\sigma(\mathbf{s})_i = \exp(s^i)/\sum_j \exp(s^j)$ is 1-Lipschitz continuous with respect to the L1-norm on its output and the L-infinity norm on its input.

Figures (5)

Figure 1: Attention module latency of H2O and our kernel on Qwen3-8B with batch size 128 and sequence length 3200. Both methods evict one token after each attention computation.
Figure 2: The data and computation flow of our method. O: attention output; I: LongFlowScore; S, P and G are intermediate states in kernel forward pass. (Left): The process of a decoding step. The current KV will cover a slot selected in the previous step, and then the static KV and Mask will be sent to the kernel together with Q for calculation to obtain the current attention output and the slot to be covered in the next step. (Middle): The data flow between HBM and SRAM in the kernel. KV and Mask will enter SRAM by block and perform fused attention calculation. (Right): The computational flow on chip. Unlike standard flash attention calculations, we split the matrix multiplication of P and V into two steps and derive LongFlowScore from the intermediate result G.
Figure 3: Performance comparison of LongFlow against baselines. The compression is conducted every step for LongFlow and every 128 steps for other methods. LongFlow achieves higher throughput and supports a larger maximum batch size due to superior memory management.
Figure 4: Accuracy of LongFlow and the baselines across different model sizes on different datasets.
Figure 5: Empirical motivation for our single-query hypothesis.

Theorems & Definitions (2)

Lemma 1: Lipschitz Property of Softmax
proof

LongFlow: Efficient KV Cache Compression for Reasoning M

Abstract

LongFlow: Efficient KV Cache Compression for Reasoning M

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)