Off-Trajectory Reasoning: Can LLMs Collaborate on Reasoning Trajectory?
Aochong Oliver Li, Tanya Goyal
TL;DR
This work investigates off-trajectory reasoning, the ability of solo-reasoning LLMs to collaborate on a shared reasoning trajectory using inputs from other agents. It formalizes Recoverability and Guidability as twin tests and evaluates 15 open-weight LLMs across five math benchmarks, revealing that stronger solo-reasoners are often more fragile under distractions and that guidability remains minimal. Through post-training control studies on distillation teachers, RL after SFT, and data filtering, the paper demonstrates that teacher vulnerabilities can transfer to students, yet RL can substantially boost recoverability when training saturates. Overall, the findings highlight significant limitations of current reasoning LLMs in collaborative settings and offer a framework for developing natively robust reasoning collaborators for multi-agent problem solving.
Abstract
Reasoning LLMs are trained to verbalize their reasoning process, yielding strong gains on complex tasks. This transparency also opens a promising direction: multiple reasoners can directly collaborate on each other's thinking within a shared trajectory, yielding better inference efficiency and exploration. A key prerequisite, however, is the ability to assess the usefulness and build on another model's partial thinking -- we call this off-trajectory reasoning. Our paper investigates a critical question: can standard solo-reasoning training pipelines deliver desired off-trajectory behaviors? We propose twin tests that capture the two extremes of the off-trajectory spectrum, namely Recoverability, which tests whether LLMs can backtrack from "distractions" induced by misleading reasoning traces, and Guidability, which tests their ability to build upon correct reasoning from stronger collaborators. Our study evaluates 15 open-weight LLMs (1.5B-32B) and reveals a counterintuitive finding -- "stronger" LLMs on benchmarks are often more fragile under distraction. Moreover, all models tested fail to effectively leverage guiding steps from collaborators on problems beyond their inherent capabilities with solve rates remaining under 9.2%. Finally, we conduct control studies to isolate the effects of three factors in post-training on these behaviors: the choice of distillation teacher, the use of RL, and data selection strategy. Our results provide actionable insights for training natively strong reasoning collaborators; e.g., we find that suboptimal recoverability behaviors of teacher models are transferred to distilled students even if the distillation trajectories are correct. Taken together, this work lays the groundwork for evaluating multi-model collaborations in shared reasoning trajectories and highlights the limitations of off-the-shelf reasoning LLMs.
