ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

Long Lian; Sida Wang; Felix Juefei-Xu; Tsu-Jui Fu; Xiuyu Li; Adam Yala; Trevor Darrell; Alane Suhr; Yuandong Tian; Xi Victoria Lin

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin

TL;DR

<3-5 sentence high-level summary> ThreadWeaver addresses the latency bottleneck of large language models during complex reasoning by introducing an adaptive parallel reasoning framework that can operate with standard autoregressive inference engines. It combines a two-stage data generation pipeline, a trie-based training/inference co-design, and a parallelization-aware RL objective (P-GRPO) to learn when to spawn parallel threads and how to balance accuracy with speed. Empirically, it matches or slightly improves sequential baselines on six math benchmarks while achieving up to $1.53\times$ token-latency speedups, establishing a Pareto frontier between accuracy and efficiency. The approach is designed to be deployment-friendly, requiring no changes to the underlying inference engine and enabling practical, scalable parallel reasoning in real-world tasks.

Abstract

Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver's performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

TL;DR

Abstract

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)