Table of Contents
Fetching ...

TVCACHE: A Stateful Tool-Value Cache for Post-Training LLM Agents

Abhishek Vijaya Kumar, Bhaskar Kataria, Byungsoo Oh, Emaad Manzoor, Rachee Singh

TL;DR

TVCache tackles the inefficiency of long tool executions during RL post-training of LLM agents by introducing a stateful tool-value cache built around a Tool Call Graph (TCG). It guarantees correctness through longest-prefix matching on tool-call trajectories and employs selective sandbox snapshotting and proactive sandbox forking to enable fast, concurrent reuse across rollouts. Across terminal-based, SQL, and video-understanding workloads, TVCache achieves up to $70\%$ cache-hit rates and up to $6.9\times$ reductions in median tool-call time, while preserving post-training reward trajectories. The system scales via cache sharding, asynchronous sandbox instantiation, and an open-source implementation suitable for integration with modern RL post-training frameworks, offering practical gains in efficiency and cost reductions for LLM agents with extensive tool-use capabilities.

Abstract

In RL post-training of LLM agents, calls to external tools take several seconds or even minutes, leaving allocated GPUs idle and inflating post-training time and cost. While many tool invocations repeat across parallel rollouts and could in principle be cached, naively caching their outputs for reuse is incorrect since tool outputs depend on the environment state induced by prior agent interactions. We present TVCACHE, a stateful tool-value cache for LLM agent post-training. TVCACHE maintains a tree of observed tool-call sequences and performs longest-prefix matching for cache lookups: a hit occurs only when the agent's full tool history matches a previously executed sequence, guaranteeing identical environment state. On three diverse workloads-terminal-based tasks, SQL generation, and video understanding. TVCACHE achieves cache hit rates of up to 70% and reduces median tool call execution time by up to 6.9X, with no degradation in post-training reward accumulation.

TVCACHE: A Stateful Tool-Value Cache for Post-Training LLM Agents

TL;DR

TVCache tackles the inefficiency of long tool executions during RL post-training of LLM agents by introducing a stateful tool-value cache built around a Tool Call Graph (TCG). It guarantees correctness through longest-prefix matching on tool-call trajectories and employs selective sandbox snapshotting and proactive sandbox forking to enable fast, concurrent reuse across rollouts. Across terminal-based, SQL, and video-understanding workloads, TVCache achieves up to cache-hit rates and up to reductions in median tool-call time, while preserving post-training reward trajectories. The system scales via cache sharding, asynchronous sandbox instantiation, and an open-source implementation suitable for integration with modern RL post-training frameworks, offering practical gains in efficiency and cost reductions for LLM agents with extensive tool-use capabilities.

Abstract

In RL post-training of LLM agents, calls to external tools take several seconds or even minutes, leaving allocated GPUs idle and inflating post-training time and cost. While many tool invocations repeat across parallel rollouts and could in principle be cached, naively caching their outputs for reuse is incorrect since tool outputs depend on the environment state induced by prior agent interactions. We present TVCACHE, a stateful tool-value cache for LLM agent post-training. TVCACHE maintains a tree of observed tool-call sequences and performs longest-prefix matching for cache lookups: a hit occurs only when the agent's full tool history matches a previously executed sequence, guaranteeing identical environment state. On three diverse workloads-terminal-based tasks, SQL generation, and video understanding. TVCACHE achieves cache hit rates of up to 70% and reduces median tool call execution time by up to 6.9X, with no degradation in post-training reward accumulation.
Paper Structure (29 sections, 2 equations, 15 figures, 3 tables)

This paper contains 29 sections, 2 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Illustration of 4 rollouts from an RL post-training iteration over time. The tokens generated in each rollout interleave between reasoning (green) and tool-calling (orange). Tools are executed in a sandbox environment and may mutate the sandbox state. In practice, tool execution comprises a significant portion of post-training time. TVCache maps tool outputs to tool calls in addition to a graph-based representation of the sandbox state to eliminate redundant tool executions and speed up post-training.
  • Figure 2: Wall-clock time taken by each rollout (sorted by the total wall-clock time) for reasoning token generation and tool call execution in the (a) terminal-bench, (b) SkyRL-SQL, and (c) EgoSchema post-training workloads (see Table \ref{['tab:datasets']} for workload details).
  • Figure 3: Tool Call Graph (TCG) $\mathcal{G}(p)$ constructed from 4 rollouts.
  • Figure 4: TVCache's architecture. The agent generating the rollout interacts with TVCache through ToolCallExecutor and ToolCallEnvironment in tvclient pip installable library.
  • Figure 5: Cache hit rates over post-training epochs for three workloads. TVCache exhibits high hit rates which increase over post-training epochs due to the tool call graph growing and branching further. Hit rates in the terminal-bench workload range from 15% to 32%. Hit rates in the SkyRL-SQL workload range from 27.0% to 57.2%. Hit rates in the EgoSchema workload range from 34% to 73.9%.
  • ...and 10 more figures