Table of Contents
Fetching ...

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng

TL;DR

Native Parallel Reasoner (NPR) enables LLMs to develop genuine native parallel reasoning without external supervision. It combines a three-stage self-distilled training pipeline (Format-follow RL, rejection sampling with parallel warmup, and Native-parallel RL) with a Parallel-Aware Policy Optimization algorithm and a robust NPR Engine for stable parallel rollout. On eight reasoning benchmarks, NPR yields up to 24.5% gains and up to 4.6x speedups, with 100% genuine parallel execution versus autoregressive baselines. The results demonstrate that self-evolved, distributed reasoning primitives can outperform teacher-guided or hand-crafted parallelism, offering scalable, efficient agentic reasoning. It also provides a practical backend for production-grade native parallel RL training.

Abstract

We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

TL;DR

Native Parallel Reasoner (NPR) enables LLMs to develop genuine native parallel reasoning without external supervision. It combines a three-stage self-distilled training pipeline (Format-follow RL, rejection sampling with parallel warmup, and Native-parallel RL) with a Parallel-Aware Policy Optimization algorithm and a robust NPR Engine for stable parallel rollout. On eight reasoning benchmarks, NPR yields up to 24.5% gains and up to 4.6x speedups, with 100% genuine parallel execution versus autoregressive baselines. The results demonstrate that self-evolved, distributed reasoning primitives can outperform teacher-guided or hand-crafted parallelism, offering scalable, efficient agentic reasoning. It also provides a practical backend for production-grade native parallel RL training.

Abstract

We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.

Paper Structure

This paper contains 44 sections, 9 equations, 5 figures, 7 tables, 2 algorithms.

Figures (5)

  • Figure 1: Native Parallel Reasoner (NPR) transforms a base model from sequential chain-of-thought (CoT) to native parallel reasoning via a self-distilled progressive training paradigm. Compared with previous SoTA, NPR achieves high reasoning accuracy, genuine parallelism and token acceleration. The illustrated results are evaluated on the AIME25 benchmark.
  • Figure 2: An overview of the NPR training framework.
  • Figure 3: Comparison of GRPO-style RL grpo and Parallel-Aware Policy Optimization.
  • Figure 4: Learning dynamics of evaluation on AIME 2025.
  • Figure : Parallel Attention Mask