Table of Contents
Fetching ...

Off-Trajectory Reasoning: Can LLMs Collaborate on Reasoning Trajectory?

Aochong Oliver Li, Tanya Goyal

TL;DR

This work investigates off-trajectory reasoning, the ability of solo-reasoning LLMs to collaborate on a shared reasoning trajectory using inputs from other agents. It formalizes Recoverability and Guidability as twin tests and evaluates 15 open-weight LLMs across five math benchmarks, revealing that stronger solo-reasoners are often more fragile under distractions and that guidability remains minimal. Through post-training control studies on distillation teachers, RL after SFT, and data filtering, the paper demonstrates that teacher vulnerabilities can transfer to students, yet RL can substantially boost recoverability when training saturates. Overall, the findings highlight significant limitations of current reasoning LLMs in collaborative settings and offer a framework for developing natively robust reasoning collaborators for multi-agent problem solving.

Abstract

Reasoning LLMs are trained to verbalize their reasoning process, yielding strong gains on complex tasks. This transparency also opens a promising direction: multiple reasoners can directly collaborate on each other's thinking within a shared trajectory, yielding better inference efficiency and exploration. A key prerequisite, however, is the ability to assess the usefulness and build on another model's partial thinking -- we call this off-trajectory reasoning. Our paper investigates a critical question: can standard solo-reasoning training pipelines deliver desired off-trajectory behaviors? We propose twin tests that capture the two extremes of the off-trajectory spectrum, namely Recoverability, which tests whether LLMs can backtrack from "distractions" induced by misleading reasoning traces, and Guidability, which tests their ability to build upon correct reasoning from stronger collaborators. Our study evaluates 15 open-weight LLMs (1.5B-32B) and reveals a counterintuitive finding -- "stronger" LLMs on benchmarks are often more fragile under distraction. Moreover, all models tested fail to effectively leverage guiding steps from collaborators on problems beyond their inherent capabilities with solve rates remaining under 9.2%. Finally, we conduct control studies to isolate the effects of three factors in post-training on these behaviors: the choice of distillation teacher, the use of RL, and data selection strategy. Our results provide actionable insights for training natively strong reasoning collaborators; e.g., we find that suboptimal recoverability behaviors of teacher models are transferred to distilled students even if the distillation trajectories are correct. Taken together, this work lays the groundwork for evaluating multi-model collaborations in shared reasoning trajectories and highlights the limitations of off-the-shelf reasoning LLMs.

Off-Trajectory Reasoning: Can LLMs Collaborate on Reasoning Trajectory?

TL;DR

This work investigates off-trajectory reasoning, the ability of solo-reasoning LLMs to collaborate on a shared reasoning trajectory using inputs from other agents. It formalizes Recoverability and Guidability as twin tests and evaluates 15 open-weight LLMs across five math benchmarks, revealing that stronger solo-reasoners are often more fragile under distractions and that guidability remains minimal. Through post-training control studies on distillation teachers, RL after SFT, and data filtering, the paper demonstrates that teacher vulnerabilities can transfer to students, yet RL can substantially boost recoverability when training saturates. Overall, the findings highlight significant limitations of current reasoning LLMs in collaborative settings and offer a framework for developing natively robust reasoning collaborators for multi-agent problem solving.

Abstract

Reasoning LLMs are trained to verbalize their reasoning process, yielding strong gains on complex tasks. This transparency also opens a promising direction: multiple reasoners can directly collaborate on each other's thinking within a shared trajectory, yielding better inference efficiency and exploration. A key prerequisite, however, is the ability to assess the usefulness and build on another model's partial thinking -- we call this off-trajectory reasoning. Our paper investigates a critical question: can standard solo-reasoning training pipelines deliver desired off-trajectory behaviors? We propose twin tests that capture the two extremes of the off-trajectory spectrum, namely Recoverability, which tests whether LLMs can backtrack from "distractions" induced by misleading reasoning traces, and Guidability, which tests their ability to build upon correct reasoning from stronger collaborators. Our study evaluates 15 open-weight LLMs (1.5B-32B) and reveals a counterintuitive finding -- "stronger" LLMs on benchmarks are often more fragile under distraction. Moreover, all models tested fail to effectively leverage guiding steps from collaborators on problems beyond their inherent capabilities with solve rates remaining under 9.2%. Finally, we conduct control studies to isolate the effects of three factors in post-training on these behaviors: the choice of distillation teacher, the use of RL, and data selection strategy. Our results provide actionable insights for training natively strong reasoning collaborators; e.g., we find that suboptimal recoverability behaviors of teacher models are transferred to distilled students even if the distillation trajectories are correct. Taken together, this work lays the groundwork for evaluating multi-model collaborations in shared reasoning trajectories and highlights the limitations of off-the-shelf reasoning LLMs.

Paper Structure

This paper contains 19 sections, 1 equation, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Comparison of solo (left) vs. collaborative reasoning (right) setting. LLMs of different sizes and functionalities collaborate on a shared trajectory.
  • Figure 2: Illustration of the twin tests: we perturb a model's reasoning trajectories with off-trajectory steers to evaluate its recoverability (under a distracting steer) or guidability (under a guiding steer). The distracting steer is sampled from the same reasoner but for a different question.
  • Figure 3: 15 open-weight LLMs grouped into four families. The branches indicate the source from which LLMs are derived, and the colors indicate SFT/RL training methods.
  • Figure 4: Recoverability (shared) across positions (%) of the original trajectory for 15 LLMs
  • Figure 5: Qwen2.5 models (1.5B and 3B) distilled from AM(-Thinking)-32B show consistently lower recoverability than those distilled from QwQ-32B or Qwen3-32B, while having similar performance on benchmark and guidability; the gap is significant after step 300 ($p \leq 0.005$). Stars mark each model's peak over training steps.
  • ...and 3 more figures