Table of Contents
Fetching ...

Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation

Siheng Xiong, Ali Payani, Faramarz Fekri

TL;DR

This work tackles the problem of CoT derailment in inference-time scaling for long reasoning traces, especially in smaller LMs. It introduces Multi-Path Plan Aggregation (MPPA), which explores multiple candidate planning steps and aggregates them into refined plans, paired with an architecture that keeps the base LM fixed while a LoRA-based plan aggregator learns planning refinements. To overcome credit assignment challenges on long trajectories, the authors propose Online Step-DPO using Twisted Sequential Monte Carlo to provide process-level, stepwise supervision without a separate reward model. Across math, science, and logic benchmarks, MPPA with Online Step-DPO outperforms distillation-based long-CoT and outcome-based RL baselines, even with only about 10% SFT data and 5% of preference signals, demonstrating improved robustness and efficiency in long-horizon reasoning.

Abstract

Inference-time scaling enhances the reasoning ability of a language model (LM) by extending its chain-of-thought (CoT). However, existing approaches typically generate the entire reasoning chain in a single forward pass, which often leads to CoT derailment, i.e., the reasoning trajectory drifting off course due to compounding errors. This problem is particularly severe for smaller LMs with long CoTs due to their limited capacity. To address this, we analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning and execution steps. Our analysis reveals that most reasoning errors stem from incorrect planning. Motivated by this observation, we propose Multi-Path Plan Aggregation (MPPA), a framework that augments single-pass reasoning with plan exploration and aggregation. Following a variable interval schedule based on the token position, MPPA generates multiple candidate plans and aggregates them into a refined planning step. To maintain efficiency, we adopt a minimal design in which the base LM serves as the primary policy, while a lightweight LoRA module implements the plan aggregation policy. We further observe that outcome-reward RL is inefficient for long trajectories (e.g., exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks demonstrate that, with only 10% SFT data and 5% of preference pairs, our method outperforms both the DeepSeek-R1 distillation baseline and the outcome-reward RL baseline across multiple base models and tasks.

Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation

TL;DR

This work tackles the problem of CoT derailment in inference-time scaling for long reasoning traces, especially in smaller LMs. It introduces Multi-Path Plan Aggregation (MPPA), which explores multiple candidate planning steps and aggregates them into refined plans, paired with an architecture that keeps the base LM fixed while a LoRA-based plan aggregator learns planning refinements. To overcome credit assignment challenges on long trajectories, the authors propose Online Step-DPO using Twisted Sequential Monte Carlo to provide process-level, stepwise supervision without a separate reward model. Across math, science, and logic benchmarks, MPPA with Online Step-DPO outperforms distillation-based long-CoT and outcome-based RL baselines, even with only about 10% SFT data and 5% of preference signals, demonstrating improved robustness and efficiency in long-horizon reasoning.

Abstract

Inference-time scaling enhances the reasoning ability of a language model (LM) by extending its chain-of-thought (CoT). However, existing approaches typically generate the entire reasoning chain in a single forward pass, which often leads to CoT derailment, i.e., the reasoning trajectory drifting off course due to compounding errors. This problem is particularly severe for smaller LMs with long CoTs due to their limited capacity. To address this, we analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning and execution steps. Our analysis reveals that most reasoning errors stem from incorrect planning. Motivated by this observation, we propose Multi-Path Plan Aggregation (MPPA), a framework that augments single-pass reasoning with plan exploration and aggregation. Following a variable interval schedule based on the token position, MPPA generates multiple candidate plans and aggregates them into a refined planning step. To maintain efficiency, we adopt a minimal design in which the base LM serves as the primary policy, while a lightweight LoRA module implements the plan aggregation policy. We further observe that outcome-reward RL is inefficient for long trajectories (e.g., exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks demonstrate that, with only 10% SFT data and 5% of preference pairs, our method outperforms both the DeepSeek-R1 distillation baseline and the outcome-reward RL baseline across multiple base models and tasks.

Paper Structure

This paper contains 30 sections, 4 equations, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: The Chain-of-Thought trajectories can be lengthy and the positions of the first error vary considerably, making outcome-based RL fine-tuning inefficient. The statistics in (b) are based on 100 trajectories with incorrect final answers, where the first error was manually identified.
  • Figure 2: Training long trajectories with outcome rewards is highly inefficient. Both latency and memory usage increase rapidly with trajectory length, and outcome-based supervision requires the entire trajectory to complete before providing feedback. All measurements are obtained using Verl sheng2024hybridflow with vLLM kwon2023efficient on a single A100 node.
  • Figure 3: Illustration of multi-path plan aggregation (MPPA) in long CoTs. The green boxes represent planning steps, where the model generates and aggregates multiple candidate plans before selecting or refining the best one. The white boxes represent execution steps, which follow a fixed reasoning path based on the chosen plan. This hierarchical reasoning process allows the model to explore, revise, and aggregate diverse plans to improve final reasoning accuracy.
  • Figure 4: Monte Carlo rollout analysis on MATH500. Steps are segmented by \\ n\\ n in the hidden CoTs. (a) Estimation accuracy is comparable between DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-1.5B. (b) Runtime grows with model size, with DeepSeek-R1-Distill-LLaMA-8B substantially slower than DeepSeek-R1-Distill-Qwen-1.5B. All measurements are based on a single A100 using vLLM.
  • Figure 5: Accuracy of preference judgment on LLaMA-3.1-8B with MPPA and baselines.
  • ...and 3 more figures