Table of Contents
Fetching ...

Hierarchical Semantic RL: Tackling the Problem of Dynamic Action Space for RL-based Recommendations

Minmao Wang, Xingchen Liu, Shijie Yi, Likang Wu, Hongke Zhao, Fei Pan, Qingpeng Cai, Peng Jiang

TL;DR

This paper tackles the scalability challenge in RL-based recommendations posed by dynamic, large action spaces. It introduces a fixed Semantic Action Space (SAS) built from Semantic IDs (SIDs) and a Hierarchical Policy Network (HPN) that generates SID tokens in a coarse-to-fine manner, complemented by a Multi-Level Critic (MLC) for token-level value estimation and refined credit assignment. The approach decouples policy learning from catalog dynamics, improves training stability through hierarchical residual state modeling, and achieves strong long-horizon performance on public benchmarks and a large-scale production dataset, including an 18.421% CVR lift with only a 1.251% rise in cost in online deployment. These results demonstrate a scalable, practical framework for RL-based recommendation systems capable of handling industrial-scale item catalogs and evolving item pools.

Abstract

Recommender Systems (RS) are fundamental to modern online services. While most existing approaches optimize for short-term engagement, recent work has begun to explore reinforcement learning (RL) to model long-term user value. However, these efforts face significant challenges due to the vast, dynamic action spaces inherent in recommendation, which hinder stable policy learning. To resolve this bottleneck, we introduce Hierarchical Semantic RL (HSRL), which reframes RL-based recommendation over a fixed Semantic Action Space (SAS). HSRL encodes items as Semantic IDs (SIDs) for policy learning, and maps SIDs back to their original items via a fixed, invertible lookup during execution. To align decision-making with SID generation, the Hierarchical Policy Network (HPN) operates in a coarse-to-fine manner, employing hierarchical residual state modeling to refine each level's context from the previous level's residual, thereby stabilizing training and reducing representation-decision mismatch. In parallel, a Multi-level Critic (MLC) provides token-level value estimates, enabling fine-grained credit assignment. Across public benchmarks and a large-scale production dataset from a leading Chinese short-video advertising platform, HSRL consistently surpasses state-of-the-art baselines. In online deployment over a seven-day A/B testing, it delivers an 18.421% CVR lift with only a 1.251% increase in cost, supporting HSRL as a scalable paradigm for RL-based recommendation. Our code is released at https://github.com/MinmaoWang/HSRL.

Hierarchical Semantic RL: Tackling the Problem of Dynamic Action Space for RL-based Recommendations

TL;DR

This paper tackles the scalability challenge in RL-based recommendations posed by dynamic, large action spaces. It introduces a fixed Semantic Action Space (SAS) built from Semantic IDs (SIDs) and a Hierarchical Policy Network (HPN) that generates SID tokens in a coarse-to-fine manner, complemented by a Multi-Level Critic (MLC) for token-level value estimation and refined credit assignment. The approach decouples policy learning from catalog dynamics, improves training stability through hierarchical residual state modeling, and achieves strong long-horizon performance on public benchmarks and a large-scale production dataset, including an 18.421% CVR lift with only a 1.251% rise in cost in online deployment. These results demonstrate a scalable, practical framework for RL-based recommendation systems capable of handling industrial-scale item catalogs and evolving item pools.

Abstract

Recommender Systems (RS) are fundamental to modern online services. While most existing approaches optimize for short-term engagement, recent work has begun to explore reinforcement learning (RL) to model long-term user value. However, these efforts face significant challenges due to the vast, dynamic action spaces inherent in recommendation, which hinder stable policy learning. To resolve this bottleneck, we introduce Hierarchical Semantic RL (HSRL), which reframes RL-based recommendation over a fixed Semantic Action Space (SAS). HSRL encodes items as Semantic IDs (SIDs) for policy learning, and maps SIDs back to their original items via a fixed, invertible lookup during execution. To align decision-making with SID generation, the Hierarchical Policy Network (HPN) operates in a coarse-to-fine manner, employing hierarchical residual state modeling to refine each level's context from the previous level's residual, thereby stabilizing training and reducing representation-decision mismatch. In parallel, a Multi-level Critic (MLC) provides token-level value estimates, enabling fine-grained credit assignment. Across public benchmarks and a large-scale production dataset from a leading Chinese short-video advertising platform, HSRL consistently surpasses state-of-the-art baselines. In online deployment over a seven-day A/B testing, it delivers an 18.421% CVR lift with only a 1.251% increase in cost, supporting HSRL as a scalable paradigm for RL-based recommendation. Our code is released at https://github.com/MinmaoWang/HSRL.

Paper Structure

This paper contains 49 sections, 1 theorem, 31 equations, 6 figures, 3 tables.

Key Result

proposition 1

The HPN update rule (Eq. eq:hpn-residual) implements a continuous, differentiable approximation of the SID residual update (Eq. eq:sid-residual), preserving the coarse-to-fine autoregressive dependency structure. Specifically:

Figures (6)

  • Figure 1: Action Space in Recommendation.
  • Figure 2: Overview of HSRL. A low-dimensional Semantic Action Space (SAS) maps items to fixed-length semantic IDs; a coarse-to-fine Hierarchical Policy Network (HPN) generates SID tokens autoregressively with residual context refinement, progressively narrowing the semantic subspace; a Multi-Level Critic (MLC) provides level-aware value estimation for structure-aware credit assignment; and a joint actor–critic optimization supports efficient SID-based serving.
  • Figure 3: Online Deployment.
  • Figure 4: Online Performance.
  • Figure 5: Sensitivity Analysis.
  • ...and 1 more figures

Theorems & Definitions (1)

  • proposition 1: Structural Alignment