Table of Contents
Fetching ...

Temporal Sampling for Forgotten Reasoning in LLMs

Yuetai Li, Zhangchen Xu, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Xiang Yue, Radha Poovendran

TL;DR

The paper identifies Temporal Forgetting, where intermediate training checkpoints solve problems that the final model no longer solves. It introduces Temporal Sampling, a decode-time strategy that draws outputs from multiple training checkpoints to recover forgotten reasoning without retraining or ensembling. Across multiple benchmarks and training setups, Temporal Sampling yields 4–19 point gains in Pass@k and strengthens inference-time scaling metrics, with LoRA-adapted variants offering storage-efficient deployment. The work highlights that true model competence lies in training dynamics rather than a single parameter snapshot, suggesting new directions for evaluation and deployment of LLMs.

Abstract

Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.

Temporal Sampling for Forgotten Reasoning in LLMs

TL;DR

The paper identifies Temporal Forgetting, where intermediate training checkpoints solve problems that the final model no longer solves. It introduces Temporal Sampling, a decode-time strategy that draws outputs from multiple training checkpoints to recover forgotten reasoning without retraining or ensembling. Across multiple benchmarks and training setups, Temporal Sampling yields 4–19 point gains in Pass@k and strengthens inference-time scaling metrics, with LoRA-adapted variants offering storage-efficient deployment. The work highlights that true model competence lies in training dynamics rather than a single parameter snapshot, suggesting new directions for evaluation and deployment of LLMs.

Abstract

Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.

Paper Structure

This paper contains 32 sections, 1 theorem, 16 equations, 13 figures, 8 tables.

Key Result

Theorem 1

Denote $r_{i,j}$ as the Pass@1 rate for the $j$-th checkpoint on problem $i$, $C_{i,j}$ as the number of correct samples among $N$ candidates for problem $i$ from checkpoint $j$. Let denote the probability of obtaining at least one correct answer when $k$ samples are drawn from $t$ checkpoints for problem $i$, (i.e., Pass@$k|t$), where $k_j$ is determined by the balanced integer partition of $k$

Figures (13)

  • Figure 1: (a) We observed that during RL training process of Deepseek-R1-1.5B model, 76.7% of AIME problems were solved correctly at some intermediate checkpoint, yet only 30% remained correct in the final model. We term this phenomenon as Temporal Forgetting. (b) We proposed Temporal Sampling: This method utilizes training dynamics as a source of answer diversity by distributing inference samples across multiple distinct checkpoints from the training trajectory, rather than relying solely on the single final checkpoint.
  • Figure 2: Fine-tuned models like DeepscaleR-1.5B deepscaler2025 and OpenR1-7B openr1 outperform the base model overall but also forget many questions the base model answered correctly.
  • Figure 3: Forgetting dynamics of Qwen2.5-7B during RL training. (a) Answer correctness trajectories for OlympiadBench questions across training checkpoints, illustrating solutions oscillate between correct and incorrect states. "Forget" implies that an answer was correct at the previous checkpoint but incorrect at the current one. Conversely, "Improve" implies that an answer that was incorrect at the previous checkpoint but correct at the current one. (b) Percentage of questions per benchmark that are ever forgotten or ever correct at some checkpoint during GRPO.
  • Figure 4: Pass rate distribution across training checkpoints on AIME24. Individual problems show varying pass rates over time. Temporal Sampling exploits these dynamics to improve answer diversity at inference.
  • Figure 5: Pass$@k$ for different numbers of checkpoints $t$ on the AIME2024, AMC, and AIME2025 benchmarks when using Temporal Sampling. The case $t=1$ represents the baseline of standard $Pass@k$ sampling on the final checkpoint. Our proposed Temporal Sampling for Qwen2.5-7B with $t=8$ outperforms the baseline by more than 19, 13, and 4 percentage points on AIME2024, AMC, and AIME2025, respectively, when sampling 64 responses.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof