Table of Contents
Fetching ...

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, Wenhu Chen

TL;DR

The paper shows RL reveals an emergent two-phase hierarchical reasoning in LLMs, where early RL focuses on solidifying low-level procedural skills and later shifts to mastering high-level strategic planning. It introduces Strategic Grams as a functional proxy for planning tokens and HICRA, a hierarchically-aware credit assignment algorithm that amplifies learning on planning tokens. Empirical results across multiple models and benchmarks demonstrate substantial gains over baselines, supported by analyses using semantic entropy to track strategic exploration. The work argues for a paradigm shift toward focusing optimization on high-impact planning tokens to accelerate robust, long-horizon reasoning.

Abstract

Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like ``aha moments", ``length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose Hierarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. Our extensive experiments validate that HICRA significantly outperforms strong baselines, and offer deep insights into how reasoning advances through the lens of strategic exploration.

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

TL;DR

The paper shows RL reveals an emergent two-phase hierarchical reasoning in LLMs, where early RL focuses on solidifying low-level procedural skills and later shifts to mastering high-level strategic planning. It introduces Strategic Grams as a functional proxy for planning tokens and HICRA, a hierarchically-aware credit assignment algorithm that amplifies learning on planning tokens. Empirical results across multiple models and benchmarks demonstrate substantial gains over baselines, supported by analyses using semantic entropy to track strategic exploration. The work argues for a paradigm shift toward focusing optimization on high-impact planning tokens to accelerate robust, long-horizon reasoning.

Abstract

Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like ``aha moments", ``length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose Hierarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. Our extensive experiments validate that HICRA significantly outperforms strong baselines, and offer deep insights into how reasoning advances through the lens of strategic exploration.

Paper Structure

This paper contains 25 sections, 10 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: (Left) LLM reasoning mirrors a human-like hierarchical reasoning: high-level strategic planning and low-level procedural executions. (Right) Hierarchical reasoning emerges during RL training via a two-phase dynamic. Phase ① consolidates low-level skills, marked by a token-entropy drop in execution tokens. The learning frontier then shifts to Phase ②, where the model explores and masters high-level planning, marked by increased semantic diversity, sustained reasoning enhancement and length scaling.
  • Figure 2: Reasoning from Qwen3-4B-GRPO with planning tokens (strategic grams) highlighted. Planning tokens function as the high-level strategic moves of reasoning, including logical deduction, branching and backtracing.
  • Figure 3: We track the training Dynamics of representative model families. The curves reveal a two-phase dynamics. Seen from the first two columns, the model has an initial focus on procedural consolidation, marked by sharp decrease in model perplexity (greater confidence) and token entropy (more certain) of execution tokens. This follows a shift to exploring strategic planning, evident from the third column. The diversity of strategic plans (semantic entropy) steadily increases on Qwen models or takes a turn to increase on Llama, correlating with consistently improved accuracy and longer reasoning chains (fourth column).
  • Figure 4: Comparison of Token Entropy and Semantic Entropy. (Left) Token-level Entropy is computed over the distribution of next-token probability. (Right) Semantic Entropy is computed as the Shannon Entropy over the frequency distribution of n-grams. Intuitively, Semantic Entropy gathers tokens by their semantic function and measures the semantic diversity. Token-level entropy is not de-duplicated by semantic meanings, and is thus dominated by vast amount of high-frequency low-level tokens.
  • Figure 5: Training Dynamics of Error Types. Across all models, the number of Planning & Strategy errors (red) decreases more significantly than other procedural errors (gray), indicating that RL's primary benefit comes from correcting high-level strategic faults.
  • ...and 9 more figures