Table of Contents
Fetching ...

SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models

Emil Biju, Shayan Talaei, Zhemin Huang, Mohammadreza Pourreza, Azalia Mirhoseini, Amin Saberi

TL;DR

SPRINT addresses the latency of large reasoning models by enabling dynamic parallel planning and execution during inference, rather than relying solely on sequential chain-of-thought. It introduces a two-role architecture (planner and executors) and a data-curation pipeline to convert reasoning traces into plan/execution rounds and supervised fine-tuning on ~1.7k demonstrations. Empirically, it matches or slightly exceeds the accuracy of strong baselines on math reasoning while reducing the number of sequential tokens by up to 39% on long problems and generalizes to out-of-domain tasks (GPQA and Countdown) with substantial token savings. The work highlights the significant efficiency gains possible from interleaved planning and parallel execution and outlines hardware and tool-use extensions for future improvements.

Abstract

Large reasoning models (LRMs) excel at complex reasoning tasks but typically generate lengthy sequential chains-of-thought, resulting in long inference times before arriving at the final answer. To address this challenge, we introduce SPRINT, a novel post-training and inference-time framework designed to enable LRMs to dynamically identify and exploit opportunities for parallelization during their reasoning process. SPRINT incorporates an innovative data curation pipeline that reorganizes natural language reasoning trajectories into structured rounds of long-horizon planning and parallel execution. By fine-tuning LRMs on a small amount of such curated data, the models learn to dynamically identify independent subtasks within extended reasoning processes and effectively execute them in parallel. Through extensive evaluations, we demonstrate that models fine-tuned with the SPRINT framework match the performance of reasoning models on complex domains such as mathematics while generating up to 39% fewer sequential tokens on problems requiring more than 8,000 output tokens. Finally, we observe consistent results transferred to two out-of-distribution tasks, namely GPQA and Countdown, with up to 45% and 65% reduction in average sequential tokens respectively for longer reasoning trajectories, while matching the performance of the fine-tuned reasoning model.

SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models

TL;DR

SPRINT addresses the latency of large reasoning models by enabling dynamic parallel planning and execution during inference, rather than relying solely on sequential chain-of-thought. It introduces a two-role architecture (planner and executors) and a data-curation pipeline to convert reasoning traces into plan/execution rounds and supervised fine-tuning on ~1.7k demonstrations. Empirically, it matches or slightly exceeds the accuracy of strong baselines on math reasoning while reducing the number of sequential tokens by up to 39% on long problems and generalizes to out-of-domain tasks (GPQA and Countdown) with substantial token savings. The work highlights the significant efficiency gains possible from interleaved planning and parallel execution and outlines hardware and tool-use extensions for future improvements.

Abstract

Large reasoning models (LRMs) excel at complex reasoning tasks but typically generate lengthy sequential chains-of-thought, resulting in long inference times before arriving at the final answer. To address this challenge, we introduce SPRINT, a novel post-training and inference-time framework designed to enable LRMs to dynamically identify and exploit opportunities for parallelization during their reasoning process. SPRINT incorporates an innovative data curation pipeline that reorganizes natural language reasoning trajectories into structured rounds of long-horizon planning and parallel execution. By fine-tuning LRMs on a small amount of such curated data, the models learn to dynamically identify independent subtasks within extended reasoning processes and effectively execute them in parallel. Through extensive evaluations, we demonstrate that models fine-tuned with the SPRINT framework match the performance of reasoning models on complex domains such as mathematics while generating up to 39% fewer sequential tokens on problems requiring more than 8,000 output tokens. Finally, we observe consistent results transferred to two out-of-distribution tasks, namely GPQA and Countdown, with up to 45% and 65% reduction in average sequential tokens respectively for longer reasoning trajectories, while matching the performance of the fine-tuned reasoning model.

Paper Structure

This paper contains 29 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of Sprint's inference process: 1) The planner receives the cumulative context, including previous plans and execution results, and either proposes a new set of independent tasks or terminates the process by producing the final answer. 2) A pool of executors concurrently performs each task according to their prompts. 3) The execution outcomes are appended back into the cumulative context with corresponding tags, returning to step 1 for the next iteration.
  • Figure 2: Overview of the Sprint training pipeline: (0) Starting from raw reasoning trajectories, (1) we first extract individual reasoning steps, identifying their planning and execution phases. Next, (2) we construct a DAG representing dependencies among these steps, and then (3) group steps into compact stages that can be executed in parallel. Finally, (4) after filtering and reformatting these structured stages into training samples, we perform supervised fine-tuning of a reasoning model to dynamically propose and execute parallelizable tasks.
  • Figure 3: Comparison of sequential tokens decoded during reasoning. Sequential reasoning models generate all the steps serially, resulting in long token sequences. Sprint's fine-tuning data restructures these steps into stages, grouping parallelizable plans followed by their respective executions. This organization enables Sprint's inference framework to execute these grouped steps in parallel, significantly reducing the number of sequential tokens.
  • Figure 4: Pareto plot comparing accuracy (%) and sequential token counts generated by different methods on MATH-500. While Sprint achieves slightly higher accuracy compared to the RFT model, it generates 440 ($\sim15\%$) fewer tokens on average.
  • Figure 5: Number of problems at each difficulty level in MATH-500 that pass each stage of interleaved planning before arriving at the final answer. The dashed line indicates the number of problems at each stage that exhibit parallelism (more than one plan).
  • ...and 2 more figures