Table of Contents
Fetching ...

Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper, Haocheng Xi, Nicholas Lee, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

TL;DR

The paper tackles the inefficiency of reasoning-intensive inference in large language models by moving from token-level to step-level speculative decoding. It introduces Arbitrage, a two-part framework consisting of an Arbitrage Oracle (ideal, compares draft and target steps) and an Arbitrage Router (practical, predicts the target’s advantage using only draft context). By routing decisions based on expected improvement rather than absolute draft quality, Arbitrage substantially reduces wasted target compute while preserving or improving accuracy, achieving up to ~2x end-to-end speedups across math and olympiad-style benchmarks. Extensive ablations show that a simple 2-class classifier with step annotations and balanced downsampling yields robust, threshold-insensitive routing that tracks the oracle closely. The work establishes a practical baseline for efficient reasoning with LLMs and demonstrates significant practical latency reductions in real-world reasoning tasks.

Abstract

Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.

Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

TL;DR

The paper tackles the inefficiency of reasoning-intensive inference in large language models by moving from token-level to step-level speculative decoding. It introduces Arbitrage, a two-part framework consisting of an Arbitrage Oracle (ideal, compares draft and target steps) and an Arbitrage Router (practical, predicts the target’s advantage using only draft context). By routing decisions based on expected improvement rather than absolute draft quality, Arbitrage substantially reduces wasted target compute while preserving or improving accuracy, achieving up to ~2x end-to-end speedups across math and olympiad-style benchmarks. Extensive ablations show that a simple 2-class classifier with step annotations and balanced downsampling yields robust, threshold-insensitive routing that tracks the oracle closely. The work establishes a practical baseline for efficient reasoning with LLMs and demonstrates significant practical latency reductions in real-world reasoning tasks.

Abstract

Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to at matched accuracy.

Paper Structure

This paper contains 25 sections, 2 theorems, 18 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Corollary B.2

Let $\mathcal{C}(a)\in\mathbb{R}_+$ denote computational cost (e.g., # target calls $=\sum_i a_i$). As $\tau$ varies, the threshold policies $a^\star_\tau$ trace a cost–quality curve $\bigl(\mathcal{C}(a^\star_\tau),\,\mathcal{S}(a^\star_\tau)\bigr)$ that majorizes any other policy's cost–quality pa i.e., the oracle threshold family achieves the Pareto frontier of cost versus quality.

Figures (8)

  • Figure 1: Arbitrage overview. At each reasoning step, the draft proposes a candidate. The router produces a score $\hat{y}$, which is the estimated probability that the target will outperform the draft on this step, and accepts the draft if $\hat{y}\le\tau$, otherwise escalates to the target to regenerate ($\hat{y}>\tau$). The selected step is appended to the context. The threshold $\tau$ governs the compute–quality trade-off.
  • Figure 2: Arbitrage vs. baseline step-level SD approaches. Comparison of Reward-guided Speculative Decoding (RSD, top, which we use as a baseline) and our Arbitrage algorithm (bottom). RSD accepts or rejects draft steps using an absolute PRM reward threshold: when the PRM score of a draft-generated step falls below this threshold, the step is discarded and the target model is invoked to regenerate it. This absolute criterion can trigger unnecessary target regenerations (e.g., Step 4), where the target does not significantly improve the quality of the step. Arbitrage instead estimates the expected quality gain from escalating a step, i.e., invoking the larger target model to regenerate the step rather than keeping the draft step. It only calls the target when this predicted gain is positive, thereby avoiding wasted target calls (e.g., Steps 1 and 4).
  • Figure 3: Wasted target calls vs. deferral rate. In reward-based step-level speculation, the deferral rate (x-axis) is the fraction of reasoning steps that are escalated to the target model. The wasted deferral rate (y-axis) is the percentage of those escalations where the target’s step is no better than the draft (equal or lower PRM score), relative to total number of steps. Wasted compute increases steadily with deferral rate, indicating many unnecessary target invocations under absolute-score rejection.
  • Figure 4: Arbitrage improves the compute–quality trade-off. Accuracy vs. acceptance rate for Arbitrage Oracle, Arbitrage Router, and RSD across two benchmarks (MATH500 and OlympiadBench) and three model configurations. The top row shows results on MATH500 and the bottom row on OlympiadBench. Columns (a), (b), and (c) correspond to LLaMA3 (1B/8B), LLaMA3 (8B/70B), and Qwen2.5-Math (3bit-7B/7B), respectively. In all cases, Arbitrage consistently yields higher accuracy at comparable acceptance rates, demonstrating superior compute–quality efficiency. Additional results are provided in Appendix \ref{['app:additional_results']}.
  • Figure 5: Arbitrage improves the compute–quality trade-off. Accuracy–time curves for Arbitrage Router and RSD on two LLaMA3 routing configurations. Subplot (a) reports results for a quantized-draft / full-precision-target setting (Q4-bit-8B/8B/1.5B) on MATH500, and subplot (b) for a small-draft / large-target setting (1B/8B/1.5B) on OlympiadBench. Across both configurations, Arbitrage Router consistently achieves higher accuracy at a given wall-clock time than RSD, yielding a better Pareto frontier. Each marker corresponds to a different threshold operating point; moving right indicates increased target-model invocations (and thus higher latency).
  • ...and 3 more figures

Theorems & Definitions (6)

  • proof
  • Corollary B.2: Pareto Frontier
  • proof
  • Corollary B.3: Zero Wasted Computation under Unconstrained Oracle
  • proof
  • Remark B.4: Arbitrage Oracle Assumption