Table of Contents
Fetching ...

Speculative Actions: A Lossless Framework for Faster Agentic Systems

Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, Tianyi Peng

TL;DR

The paper addresses the latency bottleneck in agent-environment loops caused by strictly sequential API calls. It introduces Speculative Actions, a lossless framework that pairs a fast Speculator with a slower, authoritative Actor to preemptively pursue likely next actions in parallel, enabling parallelization without sacrificing correctness. A formal analysis shows that, under reasonable assumptions, the end-to-end latency ratio converges to $1 - \frac{p}{1+p} \cdot \frac{\alpha}{\alpha+\beta}$ as $T \to \infty$, implying substantial speedups in ideal conditions; the framework also supports multi-step and uncertainty-aware extensions. Empirically, the approach yields meaningful time savings across chess, e-commerce dialogues, multi-hop web search, and an OS-tuning scenario, with practical guidance on model selection and safety mechanics. The work offers a general design principle—opportunistic parallelism in environment interactions—that moves toward real-time, scalable agentic systems and suggests future directions in hierarchical speculator-actor designs and reinforcement-learning-informed speculation.

Abstract

Despite growing interest in AI agents across industry and academia, their execution in an environment is often slow, hampering training, evaluation, and deployment. For example, a game of chess between two state-of-the-art agents may take hours. A critical bottleneck is that agent behavior unfolds sequentially: each action requires an API call, and these calls can be time-consuming. Inspired by speculative execution in microprocessors and speculative decoding in LLM inference, we propose speculative actions, a lossless framework for general agentic systems that predicts likely actions using faster models, enabling multiple steps to be executed in parallel. We evaluate this framework across three agentic environments: gaming, e-commerce, web search, and a "lossy" extension for an operating systems environment. In all cases, speculative actions achieve substantial accuracy in next-action prediction (up to 55%), translating into significant reductions in end-to-end latency. Moreover, performance can be further improved through stronger guessing models, top-K action prediction, multi-step speculation, and uncertainty-aware optimization, opening a promising path toward deploying low-latency agentic systems in the real world.

Speculative Actions: A Lossless Framework for Faster Agentic Systems

TL;DR

The paper addresses the latency bottleneck in agent-environment loops caused by strictly sequential API calls. It introduces Speculative Actions, a lossless framework that pairs a fast Speculator with a slower, authoritative Actor to preemptively pursue likely next actions in parallel, enabling parallelization without sacrificing correctness. A formal analysis shows that, under reasonable assumptions, the end-to-end latency ratio converges to as , implying substantial speedups in ideal conditions; the framework also supports multi-step and uncertainty-aware extensions. Empirically, the approach yields meaningful time savings across chess, e-commerce dialogues, multi-hop web search, and an OS-tuning scenario, with practical guidance on model selection and safety mechanics. The work offers a general design principle—opportunistic parallelism in environment interactions—that moves toward real-time, scalable agentic systems and suggests future directions in hierarchical speculator-actor designs and reinforcement-learning-informed speculation.

Abstract

Despite growing interest in AI agents across industry and academia, their execution in an environment is often slow, hampering training, evaluation, and deployment. For example, a game of chess between two state-of-the-art agents may take hours. A critical bottleneck is that agent behavior unfolds sequentially: each action requires an API call, and these calls can be time-consuming. Inspired by speculative execution in microprocessors and speculative decoding in LLM inference, we propose speculative actions, a lossless framework for general agentic systems that predicts likely actions using faster models, enabling multiple steps to be executed in parallel. We evaluate this framework across three agentic environments: gaming, e-commerce, web search, and a "lossy" extension for an operating systems environment. In all cases, speculative actions achieve substantial accuracy in next-action prediction (up to 55%), translating into significant reductions in end-to-end latency. Moreover, performance can be further improved through stronger guessing models, top-K action prediction, multi-step speculation, and uncertainty-aware optimization, opening a promising path toward deploying low-latency agentic systems in the real world.

Paper Structure

This paper contains 44 sections, 1 theorem, 19 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Under Assumptions assum:guessing-accuracy--assum:concurent-API, suppose at each step the speculative branch implies the correct next call$(h_{t+1},q_{t+1})$ with probability $p$, independently across $t\in[1,T-1]$. Let the latency of $\hat{g}$ be $\mathrm{Exp}(\alpha)$ and the latency of the actual

Figures (9)

  • Figure 1: Illustration of our framework in a chess-playing environment. While the Actor issues an LLM call to decide the next move, the Speculator uses a faster model to guess it. These guesses enable parallel API calls for the next steps, and once a guess is verified, the system gains time through parallelization. The process runs in the backend, ensuring a lossless speedup for the user.
  • Figure 2: Percentage of time saved and percentage of correct predictions across $5$ runs at $30$ steps.
  • Figure 3: APIs prediction accuracy across different Speculator models with various reasoning capability.
  • Figure 4: Accuracy with gemini-2.5-flash as the Actor. Speculating multiple actions (k=3) yields higher accuracy than predicting a single action.
  • Figure 5: (Left) Comparison of Speculator-Actor, Speculator-only, and Actor-only convergence. The Speculator shortens time spent exploring poor settings. The Speculator-only agent stabilizes quickly but at a worse final value. (Right) Average p95 latency over a 20-second tuning experiment showing that rapid reaction offers immediate performance benefits (see §\ref{['app:speculative-mitigation']}). Lower is better.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof