Table of Contents
Fetching ...

Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

Renos Zabounidis, Aditya Golatkar, Michael Kleinman, Alessandro Achille, Wei Xia, Stefano Soatto

TL;DR

Re-FORC introduces an adaptive reward predictor that estimates the marginal value of additional thinking tokens for reasoning with LLMs. By modeling the forecasted reward as a function of token budget with a Beta-distributed forecaster, it enables a Gittins-index–style decision policy to terminate, switch models, or extend reasoning under a compute constraint. The approach supports three practical inference-time applications—early stopping, compute-aware model selection, and test-time scaling—yielding significant compute savings and accuracy gains across five math-dedicated evaluation datasets. This framework provides a principled, upfront estimate of computation time and enables user-controlled latency-accuracy trade-offs without retraining the base model. Overall, Re-FORC advances efficient, adaptive reasoning by combining continuous reward forecasting with metareasoning–theoretic decision policies.

Abstract

We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.

Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

TL;DR

Re-FORC introduces an adaptive reward predictor that estimates the marginal value of additional thinking tokens for reasoning with LLMs. By modeling the forecasted reward as a function of token budget with a Beta-distributed forecaster, it enables a Gittins-index–style decision policy to terminate, switch models, or extend reasoning under a compute constraint. The approach supports three practical inference-time applications—early stopping, compute-aware model selection, and test-time scaling—yielding significant compute savings and accuracy gains across five math-dedicated evaluation datasets. This framework provides a principled, upfront estimate of computation time and enables user-controlled latency-accuracy trade-offs without retraining the base model. Overall, Re-FORC advances efficient, adaptive reasoning by combining continuous reward forecasting with metareasoning–theoretic decision policies.

Abstract

We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.

Paper Structure

This paper contains 22 sections, 13 equations, 14 figures.

Figures (14)

  • Figure 1: Forecast performance with reasoning progress. Correlation between Re-FORC (in \ref{['eq:forecaster-beta-mean']}) and the true reward (in \ref{['eq:expected-reward']}) as CoT progresses for Qwen3 yang2025qwen3 models, averaged across five evaluation datasets (see \ref{['ref:results']}). (left) Pearson correlation ($\rho$) (higher is better), (middle) mean squared error (MSE) (lower is better), and (right) mean absolute error (MAE) (lower is better). Performance steadily improves with reasoning length, and larger models (e.g. 8B) achieve higher $\rho$ and lower error throughout.
  • Figure 2: Early Stopping with Re-FORC. We plot the accuracy vs tokens trade-off for various Qwen-3 models: 1.7B (left), 4B (middle), and 8B (right); averaged across five reasoning datasets (see \ref{['ref:results']}). We show that Re-FORC improves the trade-off frontier over baseline inference-time scaling methods like S1muennighoff2025s1 across all model sizes and benchmarks. Results for individual datasets are in Appendix Fig. \ref{['fig:allearlystopping']}.
  • Figure 3: Model and thinking length selection with Re-FORC.(Left) We plot accuracy-compute trade-off for Qwen-3 models averaged across 5 reasoning benchmarks. We use the method proposed in \ref{['eq:model-selection']} and \ref{['eq:model-selection-size']} to perform routing with model selection, and compare against baselines. We show that the accuracy-compute frontier obtained by Re-FORC outperforms the individual models, including Avg-of-N baseline. (Right) We plot the sampling frequency of Re-FORC Pandora for different model sizes as a function of compute. The sampling frequency corresponds to the percentage questions Re-FORC chooses to sample a given model. At maximum compute budgets, Re-FORC samples from all the models (from small to large), while at minimum compute budget Re-FORC preferentially samples from the smallest 1.7B model. We show that the model routes a majority of the queries to the smaller model (especially in the low compute region), and starts routing more to the larger models only in high compute range. Results for individual datasets are in Appendix Fig. \ref{['fig:model-selection-per-ds']}.
  • Figure 4: Test-time scaling using Re-FORC. We plot the accuracy-token trade off for Pandora's box greedy search based Re-FORC algorithms for Qwen-3 reasoning models averaged across 3 reasoning benchmarks: AIME 24/25 and AMC24. We also compare our method against repeated-sampling test-time scaling baselines Avg-of-k and majority-vote. Re-FORC based scaling outperforms the baselines only to be outperformed by Pass-at-k which requires oracle access to the true solution. Our Re-FORC Pandora method in \ref{['eq:test-time-scaling']} outperforms the model-selection-based procedure in \ref{['eq:model-selection']} (purple) in low compute regime, which depicts the benefits of switching between multiple trajectories during inference. Note that we sample $32\times$ for each query during inference for all the methods. Results for individual datasets are in Appendix Fig. \ref{['fig:alltts']}.
  • Figure 5: Token distribution and problem difficulty for Qwen3 models averaged across datasets. Problems are ordered by per-model solve rate across datasets (Minerva, MATH500, AMC2024, AIME2024, AIME2025); “easy’’ problems are solved in $\geq$90% of trials and “hard’’ ones in $<$50%. Each curve shows the cumulative share of total tokens spent versus cumulative problem difficulty, with the dashed diagonal indicating uniform allocation. Increasing $\lambda$ represents a higher cost of time, encouraging more selective compute use. At high cost of time ($\lambda{=}4.0{\times}10^{-4}$, purple), the small 1.7B model allocates proportionally more tokens to easier problems while largely avoiding the hardest ones—an economically efficient strategy given its limited capability. In contrast, larger models (4B, 8B) can still benefit from deeper reasoning on difficult problems and thus concentrate their compute on the hardest $\sim$20% of tasks while conserving effort on easy cases.
  • ...and 9 more figures