Table of Contents
Fetching ...

ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning

Yu Li, Rui Miao, Zhengling Qi, Tian Lan

Abstract

The dominant paradigm for improving mathematical reasoning in language models relies on Reinforcement Learning with verifiable rewards. Yet existing methods treat each problem instance in isolation without leveraging the reusable strategies that emerge and accumulate during training. To this end, we introduce ARISE (Agent Reasoning via Intrinsic Skill Evolution), a hierarchical reinforcement learning framework, in which a shared policy operates both to manage skills at high-level and to generate responses at low-level (denoted as a Skills Manager and a Worker, respectively). The Manager maintains a tiered skill library through a dedicated skill generation rollout that performs structured summarization of successful solution traces (after execution), while employing a policy-driven selection mechanism to retrieve relevant skills to condition future rollouts (before execution). A hierarchical reward design guides the co-evolution of reasoning ability and library quality. Experiments on two base models and seven benchmarks spanning both competition mathematics and Omni-MATH show that ARISE consistently outperforms GRPO-family algorithms and memory-augmented baselines, with particularly notable gains on out-of-distribution tasks. Ablation studies confirm that each component contributes to the observed improvements and that library quality and reasoning performance improve in tandem throughout training. Code is available at \href{https://github.com/Skylanding/ARISE}{https://github.com/Skylanding/ARISE}.

ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning

Abstract

The dominant paradigm for improving mathematical reasoning in language models relies on Reinforcement Learning with verifiable rewards. Yet existing methods treat each problem instance in isolation without leveraging the reusable strategies that emerge and accumulate during training. To this end, we introduce ARISE (Agent Reasoning via Intrinsic Skill Evolution), a hierarchical reinforcement learning framework, in which a shared policy operates both to manage skills at high-level and to generate responses at low-level (denoted as a Skills Manager and a Worker, respectively). The Manager maintains a tiered skill library through a dedicated skill generation rollout that performs structured summarization of successful solution traces (after execution), while employing a policy-driven selection mechanism to retrieve relevant skills to condition future rollouts (before execution). A hierarchical reward design guides the co-evolution of reasoning ability and library quality. Experiments on two base models and seven benchmarks spanning both competition mathematics and Omni-MATH show that ARISE consistently outperforms GRPO-family algorithms and memory-augmented baselines, with particularly notable gains on out-of-distribution tasks. Ablation studies confirm that each component contributes to the observed improvements and that library quality and reasoning performance improve in tandem throughout training. Code is available at \href{https://github.com/Skylanding/ARISE}{https://github.com/Skylanding/ARISE}.
Paper Structure (38 sections, 10 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 38 sections, 10 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of ARISE on Qwen3-4B, showing performance across seven benchmarks, training reward dynamics on DeepScaleR, skill utilization over training, and accuracy gain versus token overhead relative to GRPO.
  • Figure 2: Overview of ARISE. The shared policy $\pi_\theta$ operates as both Skills Manager and Worker. Before each rollout, the manager scores cache entries via conditional log-probability and injects the selected skill into the prompt (Download). After reward computation, an additional rollout $O_{G+1}$ distills successful solutions into a structured skill document (Upload). The two-tier library consists of a compact cache (active pool for selection) and a larger reservoir (archive for future promotion), maintained by five operations: Add, Update, Evict, Load, and Delete.
  • Figure 2: Ablation on Qwen3-4B with Pass@1 (%).
  • Figure 4: Example of a generated skill document following the uniform schema.
  • Figure 5: Prompt template used for the skill generation rollout $O_{G+1}$. Up to two successful traces are included, each truncated to 400 characters. The generation uses temperature 0.7, top-$p$ 0.95, and a maximum of 192 new tokens.