Table of Contents
Fetching ...

Outcome-based Reinforcement Learning to Predict the Future

Benjamin Turtel, Danny Franklin, Kris Skotheim, Luke Hewitt, Philipp Schoenegger

TL;DR

This work extends Reinforcement Learning with Verifiable Rewards (RLVR) to the challenging task of forecasting real-world events, addressing instability from noisy, delayed outcomes with a stable online RL pipeline. By evaluating multiple on-policy algorithms (GRPO variants, ReMax) and DPO on a Polymarket dataset (including 100k synthetic questions), the authors demonstrate that a 7-run ReMax ensemble achieves competitive Brier scores and calibration, surpassing frontier models like o1 in both accuracy and probabilistic reliability. The study further shows practical value through a hypothetical trading evaluation, achieving roughly 10% ROI and notable gains on low-market-confidence questions, underscoring the potential of calibrated RL forecasters as decision-support tools. Broader impacts discuss societal risks and the importance of interpretability and human oversight when applying such models to high-stakes forecasting and financial domains.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR methods towards forecasting future real-world events - a challenging task for RL due to the very noisy (and delayed) outcomes involved. Using a novel dataset of recent questions from a prediction market, and accompanying relevant news headlines, we show that a compact (14B) reasoning model can be trained to match or surpass the predictive accuracy of frontier models like o1, while greatly improving probabilistic calibration. The model's performance is also practically meaningful: in a Polymarket trading simulation, we estimate that its bets would have yielded a return on investment of over 10% across all questions in the test set. We detail and compare approaches used in training our model, including augmenting our training-data with synthetic prediction questions, guardrails for learning stability, and median prediction sampling at inference-time.

Outcome-based Reinforcement Learning to Predict the Future

TL;DR

This work extends Reinforcement Learning with Verifiable Rewards (RLVR) to the challenging task of forecasting real-world events, addressing instability from noisy, delayed outcomes with a stable online RL pipeline. By evaluating multiple on-policy algorithms (GRPO variants, ReMax) and DPO on a Polymarket dataset (including 100k synthetic questions), the authors demonstrate that a 7-run ReMax ensemble achieves competitive Brier scores and calibration, surpassing frontier models like o1 in both accuracy and probabilistic reliability. The study further shows practical value through a hypothetical trading evaluation, achieving roughly 10% ROI and notable gains on low-market-confidence questions, underscoring the potential of calibrated RL forecasters as decision-support tools. Broader impacts discuss societal risks and the importance of interpretability and human oversight when applying such models to high-stakes forecasting and financial domains.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR methods towards forecasting future real-world events - a challenging task for RL due to the very noisy (and delayed) outcomes involved. Using a novel dataset of recent questions from a prediction market, and accompanying relevant news headlines, we show that a compact (14B) reasoning model can be trained to match or surpass the predictive accuracy of frontier models like o1, while greatly improving probabilistic calibration. The model's performance is also practically meaningful: in a Polymarket trading simulation, we estimate that its bets would have yielded a return on investment of over 10% across all questions in the test set. We detail and compare approaches used in training our model, including augmenting our training-data with synthetic prediction questions, guardrails for learning stability, and median prediction sampling at inference-time.

Paper Structure

This paper contains 22 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Mean soft-Brier score (accuracy, left) and mean expected calibration error (ECE, right) for each training algorithm on the Polymarket hold-out set. Error bars show $95\%$ confidence intervals. Lower values are better on both axes.
  • Figure 2: Left: cumulative realised profit (USD) as each model sequentially places one-share trades, ranked by ex-ante expected edge. Solid disks mark the last trade with Edge > ECE; open rings mark the last trade with Edge > 0. Right: total profit under the three bet-selection rules. Truncating the strategy at the calibration threshold (Edge > ECE) retains almost the entire upside while avoiding the loss-making tail.
  • Figure 3: Win probability of simulated bets placed by Remax, Ensemble 7, by market price.