Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning
Renos Zabounidis, Aditya Golatkar, Michael Kleinman, Alessandro Achille, Wei Xia, Stefano Soatto
TL;DR
Re-FORC introduces an adaptive reward predictor that estimates the marginal value of additional thinking tokens for reasoning with LLMs. By modeling the forecasted reward as a function of token budget with a Beta-distributed forecaster, it enables a Gittins-index–style decision policy to terminate, switch models, or extend reasoning under a compute constraint. The approach supports three practical inference-time applications—early stopping, compute-aware model selection, and test-time scaling—yielding significant compute savings and accuracy gains across five math-dedicated evaluation datasets. This framework provides a principled, upfront estimate of computation time and enables user-controlled latency-accuracy trade-offs without retraining the base model. Overall, Re-FORC advances efficient, adaptive reasoning by combining continuous reward forecasting with metareasoning–theoretic decision policies.
Abstract
We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.
