Table of Contents
Fetching ...

FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management

Xiang Liu, Hong Chen, Xuming Hu, Xiaowen Chu

TL;DR

FlowKV tackles the KV Cache bottleneck in multi-turn LLMs by introducing a training-free multi-turn isolation mechanism that prevents re-compression of older context. It is compatible with any KV cache compression method and preserves accumulated history while compressing only the latest turn's KV. A formal analysis shows that traditional nested compression causes exponential signal decay of early context, whereas FlowKV maintains the original signal, providing robust long-range dependencies. Empirically, FlowKV delivers substantial improvements in instruction following and user preference retention across datasets Multi-IF and PrefEval, using base models LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct, with gains exceeding 20% IFR on average and up to 64.5 percentage points in PrefEval, while incurring minimal overhead. The approach offers a practical, training-free solution for scalable, coherent multi-turn dialogue systems.

Abstract

Large Language Models (LLMs) are increasingly deployed in multi-turn conversational applications, where the management of the Key-Value (KV) Cache presents a significant bottleneck. The linear growth of the KV Cache with dialogue history imposes substantial computational costs, and existing eviction strategies often degrade performance by repeatedly compressing early conversational context, leading to information loss and context forgetting. This paper introduces FlowKV, a novel \textbf{multi-turn isolation mechanism} for KV Cache management, which can be applied to any KV Cache compression method without training. FlowKV's core innovation is a multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns. Compression is then strategically applied only to the newly generated KV pairs of the latest completed turn, effectively preventing the re-compression of older context and thereby mitigating catastrophic forgetting. Our results demonstrate that FlowKV consistently and significantly outperforms baseline strategies in maintaining instruction-following accuracy and user preference retention from 10.90\% to 75.40\%, particularly in later conversational turns.

FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management

TL;DR

FlowKV tackles the KV Cache bottleneck in multi-turn LLMs by introducing a training-free multi-turn isolation mechanism that prevents re-compression of older context. It is compatible with any KV cache compression method and preserves accumulated history while compressing only the latest turn's KV. A formal analysis shows that traditional nested compression causes exponential signal decay of early context, whereas FlowKV maintains the original signal, providing robust long-range dependencies. Empirically, FlowKV delivers substantial improvements in instruction following and user preference retention across datasets Multi-IF and PrefEval, using base models LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct, with gains exceeding 20% IFR on average and up to 64.5 percentage points in PrefEval, while incurring minimal overhead. The approach offers a practical, training-free solution for scalable, coherent multi-turn dialogue systems.

Abstract

Large Language Models (LLMs) are increasingly deployed in multi-turn conversational applications, where the management of the Key-Value (KV) Cache presents a significant bottleneck. The linear growth of the KV Cache with dialogue history imposes substantial computational costs, and existing eviction strategies often degrade performance by repeatedly compressing early conversational context, leading to information loss and context forgetting. This paper introduces FlowKV, a novel \textbf{multi-turn isolation mechanism} for KV Cache management, which can be applied to any KV Cache compression method without training. FlowKV's core innovation is a multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns. Compression is then strategically applied only to the newly generated KV pairs of the latest completed turn, effectively preventing the re-compression of older context and thereby mitigating catastrophic forgetting. Our results demonstrate that FlowKV consistently and significantly outperforms baseline strategies in maintaining instruction-following accuracy and user preference retention from 10.90\% to 75.40\%, particularly in later conversational turns.

Paper Structure

This paper contains 41 sections, 17 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: KV cache compression on PrefEval. (#SKV: SnapKV, #EA: ExpectedAttention).
  • Figure 2: Attention heatmap illustrating token-level focus during a 3-turn conversational interaction. The Y-axis represents query token positions, while the X-axis shows key token positions. Key observations (highlighted on the right) include: (1) Responses (e.g., Turn 1 Response) heavily focus on their corresponding query (e.g., T1Q) and the local context (local window). (2) As the dialogue progresses, responses (e.g., Turn 2 and 3 Responses) attend to an increasing span of historical context, including previous queries and responses (e.g., T1Q, T1R, T2Q, T2R). (3) Later queries (Turn 2 & 3 Queries) exhibit increased attention to both previous queries and the initial system prompt, indicating an evolving contextual understanding. These patterns underscore the complex, long-range dependencies managed by the attention mechanism in multi-turn dialogues.
  • Figure 3: Illustration of KV cache dynamics across three management strategies in a two-turn conversational setting. Each turn consists of a system prompt (Sys Prompt), user query (Query), and model response (Response), which contribute to the KV cache. Top Row (Full KV Cache): All KV states are retained, achieving high accuracy (60.72%) but leading to Out-Of-Memory (OOM) errors. Middle Row (KV Cache Eviction): Prompt-related KV cache is compressed each turn. This reduces cache size by 50% but results in a severe accuracy drop to 17.33%. Bottom Row (FlowKV - Ours): Our proposed FlowKV method also reduces cache size by 50% by selectively compressing only the current turn's prompt-related KV cache while preserving already compressed states from previous turns. This strategy effectively mitigates OOM issues while maintaining a high accuracy of 56.72%, significantly outperforming simple eviction.
  • Figure 4: LLaMA-3.1-8B-Instruct on Multi-IF with different compression methods and strategies.
  • Figure 5: LLaMA-3.1-8B-Instruct on PrefEval with different compression methods and strategies.
  • ...and 5 more figures