Table of Contents
Fetching ...

Real-Time Progress Prediction in Reasoning Language Models

Hans Peter Lynsgøe Raaschou-jensen, Constanza Fierro, Anders Søgaard

TL;DR

The paper investigates real-time progress prediction in reasoning language models by discretizing progress into 10 bins and training linear probes, then introducing a two-stage fine-tuning workflow (SFT followed by RL) to generate progress estimates during inference. It demonstrates that hidden representations partially encode progress, achieving a mean absolute error around 0.10 for sequences up to 16K tokens, with notable degradation on longer or out-of-distribution sequences and a dispersion bound in progress estimates. A data-augmentation strategy with dedicated progress markers and a two-pronged training regime (SFT with noisy labels and RL) yields the strongest progress-prediction performance, while analyses reveal a tradeoff between progress accuracy and downstream reasoning performance. The work also proposes improvements such as custom prediction tokens and masking strategies to mitigate shortcuts, and suggests scaling to larger models and broader domains to enhance robustness and applicability in real-time monitoring of reasoning processes.

Abstract

Recent advances in reasoning language models -- particularly those that use long, latent chains of thought -- have demonstrated remarkable capabilities in complex, agentic tasks. However, as these models operate over increasingly extended time horizons, their internal progress becomes opaque to users, complicating expectation management and real-time oversight. In this work, we investigate whether real-time progress prediction is feasible. We discretize progress and train a linear probe to classify reasoning states. We then introduce a two-stage fine-tuning approach that enables reasoning models to generate progress estimates (0$\rightarrow$100\%) during inference. Our best fine-tuned model achieves an average error of 10\% for sequences less than 16,000 tokens, offering a practical mechanism for monitoring and interpreting model reasoning in real time.

Real-Time Progress Prediction in Reasoning Language Models

TL;DR

The paper investigates real-time progress prediction in reasoning language models by discretizing progress into 10 bins and training linear probes, then introducing a two-stage fine-tuning workflow (SFT followed by RL) to generate progress estimates during inference. It demonstrates that hidden representations partially encode progress, achieving a mean absolute error around 0.10 for sequences up to 16K tokens, with notable degradation on longer or out-of-distribution sequences and a dispersion bound in progress estimates. A data-augmentation strategy with dedicated progress markers and a two-pronged training regime (SFT with noisy labels and RL) yields the strongest progress-prediction performance, while analyses reveal a tradeoff between progress accuracy and downstream reasoning performance. The work also proposes improvements such as custom prediction tokens and masking strategies to mitigate shortcuts, and suggests scaling to larger models and broader domains to enhance robustness and applicability in real-time monitoring of reasoning processes.

Abstract

Recent advances in reasoning language models -- particularly those that use long, latent chains of thought -- have demonstrated remarkable capabilities in complex, agentic tasks. However, as these models operate over increasingly extended time horizons, their internal progress becomes opaque to users, complicating expectation management and real-time oversight. In this work, we investigate whether real-time progress prediction is feasible. We discretize progress and train a linear probe to classify reasoning states. We then introduce a two-stage fine-tuning approach that enables reasoning models to generate progress estimates (0100\%) during inference. Our best fine-tuned model achieves an average error of 10\% for sequences less than 16,000 tokens, offering a practical mechanism for monitoring and interpreting model reasoning in real time.

Paper Structure

This paper contains 37 sections, 12 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Real-time progress tracking during LLM reasoning. The model intermittently updates the user-facing progressbar by using <progressbar>..</progressbar>.
  • Figure 2: Heatmap of probability mass and expected value over the quantiles. We see that the expected value closely follows the diagonal, indicating well-calibrated progress estimates. A similar trend can be observed in the heatmap where most probability mass is around the diagonal. Entropy is lowest near the beginning—where most trajectories start similarly—and rises toward the middle as uncertainty about reasoning continuation peaks. As the sequence progresses, entropy decreases again, reflecting convergence toward termination.
  • Figure 3: Prediction Error (MAE) vs. Reasoning Trace Length (binned into 25 groups). MAE increases as the length increases. All predictions were constrained to the interval (0,100).
  • Figure 4: Fraction of invalid predictions above 100% Across MATH500, AMC23 and OlympiadMath dataset. We observe that with SFT alone, the fraction of invalid approaches 100% at 15K tokens, where the combination of adding noise to the labels and applying reinforcement learning, almost never produces invalid predictions, even at sequence lengths up to 25K tokens.
  • Figure 5: Fraction of non-monotonic predictions as a function of Relative Position in sequence.
  • ...and 6 more figures