Table of Contents
Fetching ...

Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation

Yun Lu, Xiaoyu Shi, Hong Xie, Xiangyu Zhao, Mingsheng Shang

TL;DR

This work proposes a Denoising State Representation Module (DSRM) based on diffusion models to recover the low-entropy latent preference manifold from high-entropy, noisy interaction histories, and introduces a Hierarchical Reinforcement Learning (HRL) agent employed to decouple conflicting objectives.

Abstract

Interactive recommender systems (IRS) are increasingly optimized with Reinforcement Learning (RL) to capture the sequential nature of user-system dynamics. However, existing fairness-aware methods often suffer from a fundamental oversight: they assume the observed user state is a faithful representation of true preferences. In reality, implicit feedback is contaminated by popularity-driven noise and exposure bias, creating a distorted state that misleads the RL agent. We argue that the persistent conflict between accuracy and fairness is not merely a reward-shaping issue, but a state estimation failure. In this work, we propose \textbf{DSRM-HRL}, a framework that reformulates fairness-aware recommendation as a latent state purification problem followed by decoupled hierarchical decision-making. We introduce a Denoising State Representation Module (DSRM) based on diffusion models to recover the low-entropy latent preference manifold from high-entropy, noisy interaction histories. Built upon this purified state, a Hierarchical Reinforcement Learning (HRL) agent is employed to decouple conflicting objectives: a high-level policy regulates long-term fairness trajectories, while a low-level policy optimizes short-term engagement under these dynamic constraints. Extensive experiments on high-fidelity simulators (KuaiRec, KuaiRand) demonstrate that DSRM-HRL effectively breaks the "rich-get-richer" feedback loop, achieving a superior Pareto frontier between recommendation utility and exposure equity.

Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation

TL;DR

This work proposes a Denoising State Representation Module (DSRM) based on diffusion models to recover the low-entropy latent preference manifold from high-entropy, noisy interaction histories, and introduces a Hierarchical Reinforcement Learning (HRL) agent employed to decouple conflicting objectives.

Abstract

Interactive recommender systems (IRS) are increasingly optimized with Reinforcement Learning (RL) to capture the sequential nature of user-system dynamics. However, existing fairness-aware methods often suffer from a fundamental oversight: they assume the observed user state is a faithful representation of true preferences. In reality, implicit feedback is contaminated by popularity-driven noise and exposure bias, creating a distorted state that misleads the RL agent. We argue that the persistent conflict between accuracy and fairness is not merely a reward-shaping issue, but a state estimation failure. In this work, we propose \textbf{DSRM-HRL}, a framework that reformulates fairness-aware recommendation as a latent state purification problem followed by decoupled hierarchical decision-making. We introduce a Denoising State Representation Module (DSRM) based on diffusion models to recover the low-entropy latent preference manifold from high-entropy, noisy interaction histories. Built upon this purified state, a Hierarchical Reinforcement Learning (HRL) agent is employed to decouple conflicting objectives: a high-level policy regulates long-term fairness trajectories, while a low-level policy optimizes short-term engagement under these dynamic constraints. Extensive experiments on high-fidelity simulators (KuaiRec, KuaiRand) demonstrate that DSRM-HRL effectively breaks the "rich-get-richer" feedback loop, achieving a superior Pareto frontier between recommendation utility and exposure equity.
Paper Structure (27 sections, 11 equations, 7 figures, 4 tables)

This paper contains 27 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The motivation of DSRM-HRL. (Left) In interactive environments, the observed user state is heavily contaminated by popularity bias (red noise), obscuring true user preferences. (Middle) Existing RL methods typically fail under such noise: general RL agents succumb to the "rich-get-richer" loop (high AD), while naive fairness-aware agents sacrifice accuracy for diversity. (Right) Our proposed DSRM-HRL first purifies the state via diffusion-based denoising and then employs a hierarchical policy to dynamically balance long-term fairness and short-term utility.
  • Figure 2: The Spurious Reward Trap. Observed rewards are heavily dominated by exposure frequency rather than intrinsic relevance, creating a biased input for policy learning.
  • Figure 3: State Purification Gain. Denoising the input alone expands the Pareto frontier of accuracy and fairness without complex reward shaping.
  • Figure 4: Visualization of State Purification. DSRM transforms the popularity-dominated collapsed manifold into a disentangled, semantic preference space.
  • Figure 5: Overall framework.
  • ...and 2 more figures