Table of Contents
Fetching ...

Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition

Hanting Chen, Yasheng Wang, Kai Han, Dong Li, Lin Li, Zhenni Bi, Jinpeng Li, Haoyu Wang, Fei Mi, Mingjian Zhu, Bin Wang, Kaikai Song, Yifei Fu, Xu He, Yu Luo, Chong Zhu, Quan He, Xueyu Wu, Wei He, Hailin Hu, Yehui Tang, Dacheng Tao, Xinghao Chen, Yunhe Wang

TL;DR

Pangu Embedded presents an efficient 7B LLM reasoner designed for edge deployment on Ascend NPUs, combining a rigorously engineered two-stage training pipeline with a dual-system fast/slow thinking framework. Stage 1 builds a robust base through model-aware iterative distillation, inter-iteration model merging, and large-scale RL guided by the Multi-source Adaptive Reward System (MARS), all supported by a latency-tolerant scheduling infrastructure. Stage 2 equips the model with System 1 (fast) and System 2 (slow) thinking, featuring both manual switches via meta prompts and an adaptive mode selector trained on a curated fusion dataset, plus a repetition self-repair mechanism to improve output coherence. Across AIME 2024, GPQA, LiveCodeBench, and LawBench, a 7B Pangu Embedded often outperforms similarly sized rivals and demonstrates robust dual-mode reasoning with efficient token usage, highlighting a practical pathway to powerful yet deployable LLM reasoners.

Abstract

This work presents Pangu Embedded, an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs), featuring flexible fast and slow thinking capabilities. Pangu Embedded addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs. We propose a two-stage training framework for its construction. In Stage 1, the model is finetuned via an iterative distillation process, incorporating inter-iteration model merging to effectively aggregate complementary knowledge. This is followed by reinforcement learning on Ascend clusters, optimized by a latency-tolerant scheduler that combines stale synchronous parallelism with prioritized data queues. The RL process is guided by a Multi-source Adaptive Reward System (MARS), which generates dynamic, task-specific reward signals using deterministic metrics and lightweight LLM evaluators for mathematics, coding, and general problem-solving tasks. Stage 2 introduces a dual-system framework, endowing Pangu Embedded with a "fast" mode for routine queries and a deeper "slow" mode for complex inference. This framework offers both manual mode switching for user control and an automatic, complexity-aware mode selection mechanism that dynamically allocates computational resources to balance latency and reasoning depth. Experimental results on benchmarks including AIME 2024, GPQA, and LiveCodeBench demonstrate that Pangu Embedded with 7B parameters, outperforms similar-size models like Qwen3-8B and GLM4-9B. It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture, highlighting a promising direction for developing powerful yet practically deployable LLM reasoners.

Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition

TL;DR

Pangu Embedded presents an efficient 7B LLM reasoner designed for edge deployment on Ascend NPUs, combining a rigorously engineered two-stage training pipeline with a dual-system fast/slow thinking framework. Stage 1 builds a robust base through model-aware iterative distillation, inter-iteration model merging, and large-scale RL guided by the Multi-source Adaptive Reward System (MARS), all supported by a latency-tolerant scheduling infrastructure. Stage 2 equips the model with System 1 (fast) and System 2 (slow) thinking, featuring both manual switches via meta prompts and an adaptive mode selector trained on a curated fusion dataset, plus a repetition self-repair mechanism to improve output coherence. Across AIME 2024, GPQA, LiveCodeBench, and LawBench, a 7B Pangu Embedded often outperforms similarly sized rivals and demonstrates robust dual-mode reasoning with efficient token usage, highlighting a practical pathway to powerful yet deployable LLM reasoners.

Abstract

This work presents Pangu Embedded, an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs), featuring flexible fast and slow thinking capabilities. Pangu Embedded addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs. We propose a two-stage training framework for its construction. In Stage 1, the model is finetuned via an iterative distillation process, incorporating inter-iteration model merging to effectively aggregate complementary knowledge. This is followed by reinforcement learning on Ascend clusters, optimized by a latency-tolerant scheduler that combines stale synchronous parallelism with prioritized data queues. The RL process is guided by a Multi-source Adaptive Reward System (MARS), which generates dynamic, task-specific reward signals using deterministic metrics and lightweight LLM evaluators for mathematics, coding, and general problem-solving tasks. Stage 2 introduces a dual-system framework, endowing Pangu Embedded with a "fast" mode for routine queries and a deeper "slow" mode for complex inference. This framework offers both manual mode switching for user control and an automatic, complexity-aware mode selection mechanism that dynamically allocates computational resources to balance latency and reasoning depth. Experimental results on benchmarks including AIME 2024, GPQA, and LiveCodeBench demonstrate that Pangu Embedded with 7B parameters, outperforms similar-size models like Qwen3-8B and GLM4-9B. It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture, highlighting a promising direction for developing powerful yet practically deployable LLM reasoners.

Paper Structure

This paper contains 69 sections, 11 equations, 17 figures, 7 tables, 1 algorithm.

Figures (17)

  • Figure 1: An illustration of the Pangu Embedded training pipeline. The pipeline consists of two primary stages: Stage 1: basic reasoner construction and Stage 2: enabling fast and slow thinking in one model.
  • Figure 2: An illustration of the construction of the initial data pool.
  • Figure 3: The overall framework of the model-aware iterative distillation pipeline in Pangu Embedded. In each iteration, data samples are selectively filtered based on a model-aware complexity metric, which is evaluated using the student model from the previous iteration. This metric matches data to the student model’s current capabilities. The student model is progressively refined through multiple rounds of distillation, guided by dynamic data selection and iterative model merging.
  • Figure 4: An illustration of the Multi-source Adaptive Reward System (MARS).
  • Figure 5: An illustration of the curriculum data mixing strategy for RL training. Data complexity is assessed model-awarely, and a curated mix is fed to the RL agent.
  • ...and 12 more figures