Table of Contents
Fetching ...

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo

Abstract

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Abstract

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.
Paper Structure (13 sections, 1 theorem, 16 equations, 6 figures, 1 table)

This paper contains 13 sections, 1 theorem, 16 equations, 6 figures, 1 table.

Key Result

Proposition 1

Let $\hat{y}_S = (y_1, \ldots, y_{|\hat{y}_S|})$ be the speculative answer. Define the answer-level error event $\mathcal{E} = \bigcup_{n} \mathcal{E}_n$, where $\mathcal{E}_n$ denotes the event that token $y_n$ is incorrect. Then:

Figures (6)

  • Figure 1: Motivation and overview of SpecEyes29,78,21620,184,166.Top: Agentic MLLMs evaluate each query via a Markovian sequence of stateful tool invocations of depth $D$. This strict causal dependency prohibits parallelization, imposing a serving complexity of $\mathcal{O}(BDC)$ for $B$ queries, where $C$ denotes the tool per-step inference cost. Bottom:SpecEyes29,78,21620,184,166 enables agentic-level speculative bypass with a stateless small model and an answer-separability gate. Here, $\beta$ is the fraction of tool-free candidates after screening (\ref{['sec:parallel']}) and $\alpha$ is the acceptance rate of speculative answers among them (\ref{['sec:speceyes', 'sec:gating']}), averaging 80% and 71% across all benchmarks, respectively. All reported accuracy and speedup values are averaged across V* vstar, HR-Bench hrbench, and POPE pope.
  • Figure 2: Pipeline overview of SpecEyes29,78,21620,184,166. A batch of $B$ queries passes through a four-phase funnel. I:$\mathcal{M}_L$ screens tool necessity, splitting queries into tool-free and tool-required. II: A stateless $\mathcal{M}_S$ speculatively answers all tool-free queries with token-level logits. III: An answer separability score $S_{\text{sep}}$ gates each answer; those above $\tau$ are accepted directly. IV: Remaining queries fall back to the full agentic loop. The funnel yields $\approx\!1/(1{-}\beta\alpha)\times$ throughput speedup.
  • Figure 3: KDE of confidence scores for correct vs. incorrect samples on V* (Qwen3-VL-2B).$\Delta$ measures gating discriminability via peak distance. Compared to the noticeable overlap in baselines (a, b, d), our (c)$S_\text{sep}^\text{min}$ achieves the largest $\Delta$ with sharp bimodal separation, enabling an optimal accuracy-speed trade-off.
  • Figure 4: Ablation on the gating threshold of SpecEyes. Lowering the threshold increases speedup at cost of accuracy. Dashed horizontal lines indicate baseline accuracy.
  • Figure 5: Ablation on serving batch size. Larger batches amortize the stateless speculative stage, improving speedup with diminishing marginal gains as the stateful agentic fallback becomes the bottleneck. Curves report end-to-end speedup over the serial agentic baseline (1.0$\times$).
  • ...and 1 more figures

Theorems & Definitions (1)

  • Proposition 1