Table of Contents
Fetching ...

Policy-Guided Causal State Representation for Offline Reinforcement Learning Recommendation

Siyu Wang, Xiaocong Chen, Lina Yao

TL;DR

PGCR addresses the challenge of high-dimensional, noisy, and non-causally relevant state information in offline RL-based recommender systems by explicitly isolating causally relevant components (CRCs) from causally irrelevant components (CIRCs) in the state. It proposes a two-stage framework: first, a causal feature selection policy that intervenes on actions and uses a Wasserstein distance-based reward $W_1$ to preserve CRCs, and second, an encoder trained with an MSE objective to map states to CRC-focused latent representations. The authors prove identifiability of causal effects under interventions and show empirically that CRC-focused representations improve both offline policy performance and online CTR in simulation. This approach enables offline RLRS to learn more robust decision-making with limited data, offering practical benefits for scalable, privacy-preserving recommender systems.

Abstract

In offline reinforcement learning-based recommender systems (RLRS), learning effective state representations is crucial for capturing user preferences that directly impact long-term rewards. However, raw state representations often contain high-dimensional, noisy information and components that are not causally relevant to the reward. Additionally, missing transitions in offline data make it challenging to accurately identify features that are most relevant to user satisfaction. To address these challenges, we propose Policy-Guided Causal Representation (PGCR), a novel two-stage framework for causal feature selection and state representation learning in offline RLRS. In the first stage, we learn a causal feature selection policy that generates modified states by isolating and retaining only the causally relevant components (CRCs) while altering irrelevant components. This policy is guided by a reward function based on the Wasserstein distance, which measures the causal effect of state components on the reward and encourages the preservation of CRCs that directly influence user interests. In the second stage, we train an encoder to learn compact state representations by minimizing the mean squared error (MSE) loss between the latent representations of the original and modified states, ensuring that the representations focus on CRCs. We provide a theoretical analysis proving the identifiability of causal effects from interventions, validating the ability of PGCR to isolate critical state components for decision-making. Extensive experiments demonstrate that PGCR significantly improves recommendation performance, confirming its effectiveness for offline RL-based recommender systems.

Policy-Guided Causal State Representation for Offline Reinforcement Learning Recommendation

TL;DR

PGCR addresses the challenge of high-dimensional, noisy, and non-causally relevant state information in offline RL-based recommender systems by explicitly isolating causally relevant components (CRCs) from causally irrelevant components (CIRCs) in the state. It proposes a two-stage framework: first, a causal feature selection policy that intervenes on actions and uses a Wasserstein distance-based reward to preserve CRCs, and second, an encoder trained with an MSE objective to map states to CRC-focused latent representations. The authors prove identifiability of causal effects under interventions and show empirically that CRC-focused representations improve both offline policy performance and online CTR in simulation. This approach enables offline RLRS to learn more robust decision-making with limited data, offering practical benefits for scalable, privacy-preserving recommender systems.

Abstract

In offline reinforcement learning-based recommender systems (RLRS), learning effective state representations is crucial for capturing user preferences that directly impact long-term rewards. However, raw state representations often contain high-dimensional, noisy information and components that are not causally relevant to the reward. Additionally, missing transitions in offline data make it challenging to accurately identify features that are most relevant to user satisfaction. To address these challenges, we propose Policy-Guided Causal Representation (PGCR), a novel two-stage framework for causal feature selection and state representation learning in offline RLRS. In the first stage, we learn a causal feature selection policy that generates modified states by isolating and retaining only the causally relevant components (CRCs) while altering irrelevant components. This policy is guided by a reward function based on the Wasserstein distance, which measures the causal effect of state components on the reward and encourages the preservation of CRCs that directly influence user interests. In the second stage, we train an encoder to learn compact state representations by minimizing the mean squared error (MSE) loss between the latent representations of the original and modified states, ensuring that the representations focus on CRCs. We provide a theoretical analysis proving the identifiability of causal effects from interventions, validating the ability of PGCR to isolate critical state components for decision-making. Extensive experiments demonstrate that PGCR significantly improves recommendation performance, confirming its effectiveness for offline RL-based recommender systems.

Paper Structure

This paper contains 24 sections, 2 theorems, 14 equations, 4 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

Suppose the state $s_t$ and action $a_t$ are observable and form an MDP, as described in EQ:SCM_MDP. The variable $s_t$ satisfies the back-door criterion (see app:def) relative to the pair of variables $(a_t, s_{t+1})$ because it meets the following criteria: There is no descendant of $a_t$ in $s_t$

Figures (4)

  • Figure 1: (a) A graphical representation of causal relationships among $s_t$, $a_t$, and $r_t$, with green lines indicating the causal agent's interventions. (b) An extended diagram includes the latent state $z_t$ (blue lines), showing that $a_t$ depends on $z_{t-1}$ instead of $s_{t-1}$ (green, dashed lines) as described in \ref{['prop2']}. (c) The causal agent intervenes on actions to generate modified states $s_t^{\mathcal{I}}$, while the expert agent collects rewards from both original and modified states to train the causal policy. (d)The causal agent uses the offline dataset to generate modified states, which are processed by the encoder to learn latent representations for training the recommendation agent.
  • Figure 2: The 1-step CTR performance in the VirtualTaobao simulation is presented as the mean with error bars.
  • Figure 3: Performance comparisons in VirtualTB: (a) DDPG as the backbone, (b) SAC as the backbone, and (c) TD3 as the backbone. Ablation versions with random states are also included in each backbone.
  • Figure 4: Hyper Parameter Study in VirtualTB

Theorems & Definitions (9)

  • Definition 2.1: Structural Causal Model
  • Definition 2.2: Intervention
  • Proposition 1: Identifiability
  • Proposition 2: Optimal Policy Based on Latent State Representation
  • Definition A.1: d-Separation 10.5555/3202377
  • Definition A.2: Valid adjustment set pearl2009causality
  • Definition A.3: Back-Door Criterion pearl2009causality
  • Definition A.4: Back-Door Adjustment pearl2009causality
  • Definition A.5: Do-Calculus pearl2009causality