Outcome-based Reinforcement Learning to Predict the Future
Benjamin Turtel, Danny Franklin, Kris Skotheim, Luke Hewitt, Philipp Schoenegger
TL;DR
This work extends Reinforcement Learning with Verifiable Rewards (RLVR) to the challenging task of forecasting real-world events, addressing instability from noisy, delayed outcomes with a stable online RL pipeline. By evaluating multiple on-policy algorithms (GRPO variants, ReMax) and DPO on a Polymarket dataset (including 100k synthetic questions), the authors demonstrate that a 7-run ReMax ensemble achieves competitive Brier scores and calibration, surpassing frontier models like o1 in both accuracy and probabilistic reliability. The study further shows practical value through a hypothetical trading evaluation, achieving roughly 10% ROI and notable gains on low-market-confidence questions, underscoring the potential of calibrated RL forecasters as decision-support tools. Broader impacts discuss societal risks and the importance of interpretability and human oversight when applying such models to high-stakes forecasting and financial domains.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR methods towards forecasting future real-world events - a challenging task for RL due to the very noisy (and delayed) outcomes involved. Using a novel dataset of recent questions from a prediction market, and accompanying relevant news headlines, we show that a compact (14B) reasoning model can be trained to match or surpass the predictive accuracy of frontier models like o1, while greatly improving probabilistic calibration. The model's performance is also practically meaningful: in a Polymarket trading simulation, we estimate that its bets would have yielded a return on investment of over 10% across all questions in the test set. We detail and compare approaches used in training our model, including augmenting our training-data with synthetic prediction questions, guardrails for learning stability, and median prediction sampling at inference-time.
