SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models
Emil Biju, Shayan Talaei, Zhemin Huang, Mohammadreza Pourreza, Azalia Mirhoseini, Amin Saberi
TL;DR
SPRINT addresses the latency of large reasoning models by enabling dynamic parallel planning and execution during inference, rather than relying solely on sequential chain-of-thought. It introduces a two-role architecture (planner and executors) and a data-curation pipeline to convert reasoning traces into plan/execution rounds and supervised fine-tuning on ~1.7k demonstrations. Empirically, it matches or slightly exceeds the accuracy of strong baselines on math reasoning while reducing the number of sequential tokens by up to 39% on long problems and generalizes to out-of-domain tasks (GPQA and Countdown) with substantial token savings. The work highlights the significant efficiency gains possible from interleaved planning and parallel execution and outlines hardware and tool-use extensions for future improvements.
Abstract
Large reasoning models (LRMs) excel at complex reasoning tasks but typically generate lengthy sequential chains-of-thought, resulting in long inference times before arriving at the final answer. To address this challenge, we introduce SPRINT, a novel post-training and inference-time framework designed to enable LRMs to dynamically identify and exploit opportunities for parallelization during their reasoning process. SPRINT incorporates an innovative data curation pipeline that reorganizes natural language reasoning trajectories into structured rounds of long-horizon planning and parallel execution. By fine-tuning LRMs on a small amount of such curated data, the models learn to dynamically identify independent subtasks within extended reasoning processes and effectively execute them in parallel. Through extensive evaluations, we demonstrate that models fine-tuned with the SPRINT framework match the performance of reasoning models on complex domains such as mathematics while generating up to 39% fewer sequential tokens on problems requiring more than 8,000 output tokens. Finally, we observe consistent results transferred to two out-of-distribution tasks, namely GPQA and Countdown, with up to 45% and 65% reduction in average sequential tokens respectively for longer reasoning trajectories, while matching the performance of the fine-tuned reasoning model.
