Table of Contents
Fetching ...

Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning

Jie Deng, Hanshuang Tong, Jun Li, Shining Liang, Ning Wu, Hongzhi Li, Yutao Xie

TL;DR

TrajFusion reframes rejection sampling fine-tuning by adaptively fusing selected incorrect trajectories with correct solutions to create fused training samples that simulate trial-and-error reasoning. It uses reflection prompts and diversity-aware selection to harness informative failure modes, improving data efficiency and performance across multiple LLM backbones and long-context benchmarks. Experiments show consistent gains over vanilla RFT on diverse math datasets, including long-form tasks and 32K-context scenarios, while preserving a simple training pipeline without architectural changes. The approach highlights the value of structured negative signals in mathematical reasoning, enabling scalable improvements without extra sampling budgets or model alterations.

Abstract

Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling that retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of each fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.

Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning

TL;DR

TrajFusion reframes rejection sampling fine-tuning by adaptively fusing selected incorrect trajectories with correct solutions to create fused training samples that simulate trial-and-error reasoning. It uses reflection prompts and diversity-aware selection to harness informative failure modes, improving data efficiency and performance across multiple LLM backbones and long-context benchmarks. Experiments show consistent gains over vanilla RFT on diverse math datasets, including long-form tasks and 32K-context scenarios, while preserving a simple training pipeline without architectural changes. The approach highlights the value of structured negative signals in mathematical reasoning, enabling scalable improvements without extra sampling budgets or model alterations.

Abstract

Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling that retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of each fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.
Paper Structure (55 sections, 40 equations, 4 figures, 6 tables)

This paper contains 55 sections, 40 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Sampling statistics for Qwen2.5-Math-7B-Instruct on the DeepMath dataset. Left: distribution of error rates across problems. Right: diversity of incorrect final answers measured by Shannon entropy
  • Figure 2: Comparison between vanilla rejection-sampling fine-tuning (RFT) and our TrajFusion framework. Top: Vanilla RFT samples multiple responses from a teacher model and retains only verified correct trajectories ($\mathcal{Y}^+$) for training, discarding all incorrect ones. Bottom: TrajFusion explicitly separates correct ($\mathcal{Y}^+$) and incorrect ($\mathcal{Y}^-$) trajectories, performs problem-level error analysis, and constructs fused training trajectories by selectively integrating informative incorrect reasoning paths with corrected trajectories.
  • Figure 3: Pass@1 accuracy on MATH during the first training epoch, evaluated at fixed steps. TrajFusion (red) consistently outperforms Vanilla RFT (gray) across both models. The shaded areas highlight the performance margin, with a significant gap established early in training.
  • Figure 4: Average output length and reflection frequency on GSM8K and MATH. (a, b) TrajFusion initially results in longer generations, but the average length gradually decreases as training progresses. (c, d) The frequency of reflection tokens shows an upward trend, particularly on the more complex MATH dataset.