Table of Contents
Fetching ...

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin

TL;DR

<3-5 sentence high-level summary> ThreadWeaver addresses the latency bottleneck of large language models during complex reasoning by introducing an adaptive parallel reasoning framework that can operate with standard autoregressive inference engines. It combines a two-stage data generation pipeline, a trie-based training/inference co-design, and a parallelization-aware RL objective (P-GRPO) to learn when to spawn parallel threads and how to balance accuracy with speed. Empirically, it matches or slightly improves sequential baselines on six math benchmarks while achieving up to $1.53\times$ token-latency speedups, establishing a Pareto frontier between accuracy and efficiency. The approach is designed to be deployment-friendly, requiring no changes to the underlying inference engine and enabling practical, scalable parallel reasoning in real-world tasks.

Abstract

Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver's performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

TL;DR

<3-5 sentence high-level summary> ThreadWeaver addresses the latency bottleneck of large language models during complex reasoning by introducing an adaptive parallel reasoning framework that can operate with standard autoregressive inference engines. It combines a two-stage data generation pipeline, a trie-based training/inference co-design, and a parallelization-aware RL objective (P-GRPO) to learn when to spawn parallel threads and how to balance accuracy with speed. Empirically, it matches or slightly improves sequential baselines on six math benchmarks while achieving up to token-latency speedups, establishing a Pareto frontier between accuracy and efficiency. The approach is designed to be deployment-friendly, requiring no changes to the underlying inference engine and enabling practical, scalable parallel reasoning in real-world tasks.

Abstract

Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver's performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.

Paper Structure

This paper contains 76 sections, 32 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: Left: Sequential reasoning solves the problem step by step iteratively, so its reasoning latency grows proportionally to the length of the reasoning chain and cannot be reduced by allocating more compute. ThreadWeaver instead creates concurrent reasoning threads adaptively that tackle different parts of the solution through spawn and join operations, effectively shortening the critical path when additional compute is available. Right: Per-problem speedup histograms on AIME24 and MATH500 for ThreadWeaver showing up to $3\times$ per-problem speedup in token latency. This acceleration is achieved without loss in accuracy.
  • Figure 2: Format for parallelized reasoning trajectories. <Parallel> encloses a fork-join block with <Outlines> and multiple <Thread> sections; content inside each<Thread>is intended to be generated concurrently by the runtime, while all other spans are decoded sequentially. This toy snippet is for illustration only since the actual trajectories can be more than 10k tokens.
  • Figure 3: Inference-time request sequence for a parallel reasoning trajectory. For a given user prompt, Timestep 1 decodes the prefix and the <Outlines> block sequentially up to </Outlines>. Timestep 2 launches one completion request per outline in parallel, each seeded with its corresponding <Thread> $i$: prefix and stopped at </Thread>. Timestep 3 resumes sequential decoding over the joined context until [EOS] or the next </Outlines>.
  • Figure 4: Our training sequence formatting consists of three steps: 1) extract context and completion segments that the inference state machine will encounter in order to produce this trajectory; 2) insert these segments into a token-level trie (prefix-tree); 3) traverse the trie to produce a flat sequence with an ancestor-only attention mask and standard incremental positions counting from the root node. $*$ denotes text spans where loss is applied. Although we only show one branching in this example, ThreadWeaver supports branching and joining multiple times in a trajectory.
  • Figure 5: Per-problem speedup distributions on math benchmarks. The speedup is the ratio between the token latency of the sequential reasoning baseline and the token latency of ThreadWeaver. A vertical reference line at $1.0\times$ marks parity with the sequential baseline. ThreadWeaver achieves a significant speedup on most problems.
  • ...and 2 more figures