Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation
Siheng Xiong, Ali Payani, Faramarz Fekri
TL;DR
This work tackles the problem of CoT derailment in inference-time scaling for long reasoning traces, especially in smaller LMs. It introduces Multi-Path Plan Aggregation (MPPA), which explores multiple candidate planning steps and aggregates them into refined plans, paired with an architecture that keeps the base LM fixed while a LoRA-based plan aggregator learns planning refinements. To overcome credit assignment challenges on long trajectories, the authors propose Online Step-DPO using Twisted Sequential Monte Carlo to provide process-level, stepwise supervision without a separate reward model. Across math, science, and logic benchmarks, MPPA with Online Step-DPO outperforms distillation-based long-CoT and outcome-based RL baselines, even with only about 10% SFT data and 5% of preference signals, demonstrating improved robustness and efficiency in long-horizon reasoning.
Abstract
Inference-time scaling enhances the reasoning ability of a language model (LM) by extending its chain-of-thought (CoT). However, existing approaches typically generate the entire reasoning chain in a single forward pass, which often leads to CoT derailment, i.e., the reasoning trajectory drifting off course due to compounding errors. This problem is particularly severe for smaller LMs with long CoTs due to their limited capacity. To address this, we analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning and execution steps. Our analysis reveals that most reasoning errors stem from incorrect planning. Motivated by this observation, we propose Multi-Path Plan Aggregation (MPPA), a framework that augments single-pass reasoning with plan exploration and aggregation. Following a variable interval schedule based on the token position, MPPA generates multiple candidate plans and aggregates them into a refined planning step. To maintain efficiency, we adopt a minimal design in which the base LM serves as the primary policy, while a lightweight LoRA module implements the plan aggregation policy. We further observe that outcome-reward RL is inefficient for long trajectories (e.g., exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks demonstrate that, with only 10% SFT data and 5% of preference pairs, our method outperforms both the DeepSeek-R1 distillation baseline and the outcome-reward RL baseline across multiple base models and tasks.
