Table of Contents
Fetching ...

LLMs Can Teach Themselves to Better Predict the Future

Benjamin Turtel, Danny Franklin, Philipp Schoenegger

TL;DR

The paper tackles LLM-based forecasting by removing dependence on human-curated reasoning samples and instead learning from resolved outcomes through self-play. It introduces a six-step pipeline that combines data collection from prediction markets, news augmentation, synthetic self-generated reasoning with probabilistic forecasts, resolution-driven re-ranking by closeness to outcomes, and Direct Preference Optimization fine-tuning with LoRA adapters on 4-bit quantized bases. Empirical results on Phi-4 14B and DeepSeek-R1 14B show 7–10% improvements in forecasting accuracy over base and randomized-control models, with performance approaching that of GPT-4o, and statistically robust improvements confirmed by BH-corrected tests. The work demonstrates that self-play-derived reasoning data can robustly enhance probabilistic forecasting in LLMs, offering a scalable alternative to human-annotated data and a potential path to frontier-level capabilities in real-world forecasting tasks, even under resource-efficient settings. Key metrics include the Brier score $BS = \frac{1}{N}\sum (p_i - o_i)^2$ and the ranking metric $r(p,o)=|p-o|$, which anchor the evaluation and training signal.

Abstract

We present an outcome-driven fine-tuning framework that enhances the forecasting capabilities of large language models (LLMs) without relying on human-curated reasoning samples. Our method leverages model self-play to generate pairs of diverse reasoning trajectories and probabilistic forecasts for a set of diverse questions that resolve after the models' knowledge cutoff date. We then rank pairs of these reasoning traces by their distance to the actual outcomes before fine-tuning the model via Direct Preference Optimization (DPO). On a separate test set, our approach increases prediction accuracy of Phi-4 14B and DeepSeek-R1 14B by between 7--10\% over a base model and a DPO fine-tuned control model with randomized labels, bringing them on par with forecasting capabilities of much larger frontier models like GPT-4o.

LLMs Can Teach Themselves to Better Predict the Future

TL;DR

The paper tackles LLM-based forecasting by removing dependence on human-curated reasoning samples and instead learning from resolved outcomes through self-play. It introduces a six-step pipeline that combines data collection from prediction markets, news augmentation, synthetic self-generated reasoning with probabilistic forecasts, resolution-driven re-ranking by closeness to outcomes, and Direct Preference Optimization fine-tuning with LoRA adapters on 4-bit quantized bases. Empirical results on Phi-4 14B and DeepSeek-R1 14B show 7–10% improvements in forecasting accuracy over base and randomized-control models, with performance approaching that of GPT-4o, and statistically robust improvements confirmed by BH-corrected tests. The work demonstrates that self-play-derived reasoning data can robustly enhance probabilistic forecasting in LLMs, offering a scalable alternative to human-annotated data and a potential path to frontier-level capabilities in real-world forecasting tasks, even under resource-efficient settings. Key metrics include the Brier score and the ranking metric , which anchor the evaluation and training signal.

Abstract

We present an outcome-driven fine-tuning framework that enhances the forecasting capabilities of large language models (LLMs) without relying on human-curated reasoning samples. Our method leverages model self-play to generate pairs of diverse reasoning trajectories and probabilistic forecasts for a set of diverse questions that resolve after the models' knowledge cutoff date. We then rank pairs of these reasoning traces by their distance to the actual outcomes before fine-tuning the model via Direct Preference Optimization (DPO). On a separate test set, our approach increases prediction accuracy of Phi-4 14B and DeepSeek-R1 14B by between 7--10\% over a base model and a DPO fine-tuned control model with randomized labels, bringing them on par with forecasting capabilities of much larger frontier models like GPT-4o.

Paper Structure

This paper contains 10 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview Flowchart
  • Figure 2: Accuracy Results for all Models
  • Figure 3: Per-Epoch Accuracy.
  • Figure 4: Forecasting Prompts by Model
  • Figure 5: Ridge Plot of Forecasting Accuracy for each Model.