Table of Contents
Fetching ...

SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

Sanjay Kariyappa, G. Edward Suh

TL;DR

SideQuest is a novel approach that leverages the Large Reasoning Model (LRM) itself to perform KV cache compression by reasoning about the usefulness of tokens in its context, and reduces peak token usage on agentic tasks with minimal degradation in accuracy, outperforming heuristic-based KV cache compression techniques.

Abstract

Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by tokens from external retrieval, causing memory usage to grow rapidly and limiting decode performance. While several KV cache compression techniques exist for long-context inputs, we find that existing heuristics fail to support multi-step reasoning models effectively. We address this challenge with SideQuest -- a novel approach that leverages the Large Reasoning Model (LRM) itself to perform KV cache compression by reasoning about the usefulness of tokens in its context. To prevent the tokens associated with this management process from polluting the model's memory, we frame KV cache compression as an auxiliary task executed in parallel to the main reasoning task. Our evaluations, using a model trained with just 215 samples, show that SideQuest reduces peak token usage by up to 65% on agentic tasks with minimal degradation in accuracy, outperforming heuristic-based KV cache compression techniques.

SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

TL;DR

SideQuest is a novel approach that leverages the Large Reasoning Model (LRM) itself to perform KV cache compression by reasoning about the usefulness of tokens in its context, and reduces peak token usage on agentic tasks with minimal degradation in accuracy, outperforming heuristic-based KV cache compression techniques.

Abstract

Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by tokens from external retrieval, causing memory usage to grow rapidly and limiting decode performance. While several KV cache compression techniques exist for long-context inputs, we find that existing heuristics fail to support multi-step reasoning models effectively. We address this challenge with SideQuest -- a novel approach that leverages the Large Reasoning Model (LRM) itself to perform KV cache compression by reasoning about the usefulness of tokens in its context. To prevent the tokens associated with this management process from polluting the model's memory, we frame KV cache compression as an auxiliary task executed in parallel to the main reasoning task. Our evaluations, using a model trained with just 215 samples, show that SideQuest reduces peak token usage by up to 65% on agentic tasks with minimal degradation in accuracy, outperforming heuristic-based KV cache compression techniques.
Paper Structure (27 sections, 7 figures, 2 algorithms)

This paper contains 27 sections, 7 figures, 2 algorithms.

Figures (7)

  • Figure 1: Walkthrough example of SideQuest. (a) The main thread processes the user request by performing multi-turn reasoning and tool calling. (b) At regular intervals we spawn an auxiliary thread that runs in parallel with the shared context (c) The auxiliary thread reflects on the context and lists the cursors that can be deleted. (d) We clear the messages in the context by invoking a tool, reducing the context size for future turns.
  • Figure 2: Distribution of ReAct Iterations and token count for FRAMES and BrowseComp with gpt-oss-20b (medium effort).
  • Figure 3: Efficiency vs. Utility Trade-off. We evaluate Accuracy against Peak Token Usage and KV cache memory reads for gpt-oss-20b with Medium and High reasoning effort, on the FRAMES and BrowseComp benchmarks. The Uncompressed Baseline establishes the upper bound for accuracy but incurs the highest memory cost. SideQuest achieves substantial memory savings—reducing peak token usage by 56-65% compared to the baseline—while providing a better accuracy compared to heuristic based methods.
  • Figure 4: Non-Completion Rate across benchmarks, categorized by failure type: Unparsable Responses (orange), Context Limits (green), and Turn Limits (purple). SideQuest demonstrates superior reliability, matching the near-zero failure rate of the uncompressed baseline, while other methods suffer from high rates of model collapse.
  • Figure 5: Serving Performance in SGLang. We compare Sidequest against the uncompressed baseline for gpt-oss-20b (Medium Effort) on the FRAMES benchmark using a single NVIDIA H100 GPU. (Left) Sidequest increases peak throughput by $83.9\%$ by enabling larger batch sizes. (Center) Peak KV cache usage is reduced by $53.9\%$, freeing up significant memory headroom. (Right) The combination of higher concurrency and reduced memory movement lowers total benchmark runtime by $36.8\%$.
  • ...and 2 more figures