Table of Contents
Fetching ...

ZIP-RC: Optimizing Test-Time Compute via Zero-Overhead Joint Reward-Cost Prediction

Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner, Sergey Levine

TL;DR

ZIP-RC addresses the lack of introspective reasoning in LLMs by providing zero-overhead, inference-time predictions of the joint reward and cost, enabling adaptive test-time compute via a principled sampling utility. It repurposes reserved logits to output a joint reward-cost distribution and uses this to guide a meta-MDP-based sampling strategy that optimizes compute, latency, and accuracy in real time. Experiments on mixed-difficulty mathematical benchmarks show up to 12% absolute accuracy gains at equal or lower cost and reveal smooth Pareto frontiers between quality, compute, and latency, demonstrating adaptive, interpretable, and efficient inference. This work lays the groundwork for more autonomous, resource-aware LLMs that can allocate compute adaptively during decoding.

Abstract

Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.

ZIP-RC: Optimizing Test-Time Compute via Zero-Overhead Joint Reward-Cost Prediction

TL;DR

ZIP-RC addresses the lack of introspective reasoning in LLMs by providing zero-overhead, inference-time predictions of the joint reward and cost, enabling adaptive test-time compute via a principled sampling utility. It repurposes reserved logits to output a joint reward-cost distribution and uses this to guide a meta-MDP-based sampling strategy that optimizes compute, latency, and accuracy in real time. Experiments on mixed-difficulty mathematical benchmarks show up to 12% absolute accuracy gains at equal or lower cost and reveal smooth Pareto frontiers between quality, compute, and latency, demonstrating adaptive, interpretable, and efficient inference. This work lays the groundwork for more autonomous, resource-aware LLMs that can allocate compute adaptively during decoding.

Abstract

Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.

Paper Structure

This paper contains 38 sections, 1 theorem, 24 equations, 6 figures, 3 tables.

Key Result

Theorem 5.1

At every timestep $t$, our strategy $\mu^{\text{ZIP-RC}}$ performs better than any predefined strategy $\mu \in M_t$. Namely, for any meta-state $S_t$, we have

Figures (6)

  • Figure 1: Top left shows how ZIP repurposes reserved or unused logits in the output head of a language model to instantiate auxiliary predictions, such as the grid mapping for the joint reward-cost distribution that ZIP-RC uses. Top right demonstrates how ZIP-RC can provide real-time expected reward and remaining length predictions. Finally, the bottom shows the joint distributions from ZIP-RC and how they indicate optimal sampling strategies. ZIP-RC sampling uses these joint distributions to calculate a sampling utility to autonomously select meta-actions for optimal test-time compute allocation.
  • Figure 2: Predictions and ground truth for the initial joint distributions of 10 questions randomly sampled from the AMC 2023 benchmark and 10 questions from the AIME 2024 benchmark. The ground truth for each prompt was estimated with 256 rollouts from Qwen3-1.7B, and predictions were made using ZIP-RC trained with the same model. This shows that the joint distribution from ZIP-RC is calibrated and relatively accurate in forecasting the outcomes of its own rollouts.
  • Figure 3: Performance of ZIP-RC sampling and baselines across all models and benchmarks. The top half demonstrates the latency bound setting where $\alpha = 0.1$, and the bottom half demonstrates the compute bound setting where $\alpha = 1.0$. Adjusting $\beta$ in ZIP-RC sampling allows it to trade generation cost for higher performance (similar to increasing $N$ in BoN) while adjusting $\alpha$ allows it to adjust the prioritization of compute and latency. By navigating the Pareto frontier and allocating generation cost adaptively, ZIP-RC sampling significantly outperforms majority voting and other baselines.
  • Figure 4: KL divergence from the original policy during training of ZIP-RC with and without the KL term. Using $\alpha_{\mathrm{KL}} = 10$ keeps the KL nearly zero throughout training, stabilizing around $0.005$. Without the KL term, the policy eventually changes, emphasizing the importance and effectiveness of this component of the ZIP objective. We used the same training data as in our main experiments.
  • Figure 5: Similar to the demonstration of ZIP-RC's joint distribution prediction in \ref{['fig:first_joint_visual']}, we visualize the joint distribution predictions from ZIP-RC-Lite and compare them with ground truth estimates. While the predictions correlate with the ground truth, ZIP-RC-Lite tends to produce more similar-looking distributions across prompts and overestimates variance compared to ZIP-RC.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 5.1
  • proof