Table of Contents
Fetching ...

Safely Exploring Novel Actions in Recommender Systems via Deployment-Efficient Policy Learning

Haruka Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno, Takuma Udagawa

TL;DR

The paper tackles safely exploring novel actions in recommender systems using offline data. It introduces Safe Off-Policy Policy Gradient (Safe OPG), which enforces a high-probability safety constraint via High Confidence Off-Policy Evaluation, but observes conservatism that limits novelty. To address this, the authors propose DEPSUE, a deployment-efficient framework that relaxes safety regularization across multiple deployments by accumulating safety margins, enabling controlled exploration of novel actions. Across semi-synthetic and real-world datasets, Safe OPG achieves safety guarantees while DEPSUE achieves safer, more extensive exploration of novel actions with lower deployment costs than online learning. Collectively, the work provides a practical path to balancing safety and novelty in deployment-bound policy learning for evolving action spaces.

Abstract

In many real recommender systems, novel items are added frequently over time. The importance of sufficiently presenting novel actions has widely been acknowledged for improving long-term user engagement. A recent work builds on Off-Policy Learning (OPL), which trains a policy from only logged data, however, the existing methods can be unsafe in the presence of novel actions. Our goal is to develop a framework to enforce exploration of novel actions with a guarantee for safety. To this end, we first develop Safe Off-Policy Policy Gradient (Safe OPG), which is a model-free safe OPL method based on a high confidence off-policy evaluation. In our first experiment, we observe that Safe OPG almost always satisfies a safety requirement, even when existing methods violate it greatly. However, the result also reveals that Safe OPG tends to be too conservative, suggesting a difficult tradeoff between guaranteeing safety and exploring novel actions. To overcome this tradeoff, we also propose a novel framework called Deployment-Efficient Policy Learning for Safe User Exploration, which leverages safety margin and gradually relaxes safety regularization during multiple (not many) deployments. Our framework thus enables exploration of novel actions while guaranteeing safe implementation of recommender systems.

Safely Exploring Novel Actions in Recommender Systems via Deployment-Efficient Policy Learning

TL;DR

The paper tackles safely exploring novel actions in recommender systems using offline data. It introduces Safe Off-Policy Policy Gradient (Safe OPG), which enforces a high-probability safety constraint via High Confidence Off-Policy Evaluation, but observes conservatism that limits novelty. To address this, the authors propose DEPSUE, a deployment-efficient framework that relaxes safety regularization across multiple deployments by accumulating safety margins, enabling controlled exploration of novel actions. Across semi-synthetic and real-world datasets, Safe OPG achieves safety guarantees while DEPSUE achieves safer, more extensive exploration of novel actions with lower deployment costs than online learning. Collectively, the work provides a practical path to balancing safety and novelty in deployment-bound policy learning for evolving action spaces.

Abstract

In many real recommender systems, novel items are added frequently over time. The importance of sufficiently presenting novel actions has widely been acknowledged for improving long-term user engagement. A recent work builds on Off-Policy Learning (OPL), which trains a policy from only logged data, however, the existing methods can be unsafe in the presence of novel actions. Our goal is to develop a framework to enforce exploration of novel actions with a guarantee for safety. To this end, we first develop Safe Off-Policy Policy Gradient (Safe OPG), which is a model-free safe OPL method based on a high confidence off-policy evaluation. In our first experiment, we observe that Safe OPG almost always satisfies a safety requirement, even when existing methods violate it greatly. However, the result also reveals that Safe OPG tends to be too conservative, suggesting a difficult tradeoff between guaranteeing safety and exploring novel actions. To overcome this tradeoff, we also propose a novel framework called Deployment-Efficient Policy Learning for Safe User Exploration, which leverages safety margin and gradually relaxes safety regularization during multiple (not many) deployments. Our framework thus enables exploration of novel actions while guaranteeing safe implementation of recommender systems.

Paper Structure

This paper contains 34 sections, 1 theorem, 20 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

(Existence of a local saddle point) Under the Assumptions assm:bound-assm:regularization, there exists a local saddle point $(\pi^{\star}, \lambda^{\star})$ that satisfies, given $\mathcal{D}_0^{\mathrm{(S1)}}$, $\forall \epsilon > 0, \pi \in \Pi, \lambda \in \mathbb{R}_{+}$,

Figures (6)

  • Figure 1: Comparing safety and novelty of Safe OPG, OPG (naive), and OPG (w/ CQL) under various logging policies; (a) relative policy value compared to that of the logging policy ($V(\pi) / V(\pi_0)$) with 95% confidence interval, (b) probability of a policy violating the safety constraint ($P(V(\pi) < C)$) estimated with 30 simulation runs, and (c) novelty averaged over 30 simulation runs with 95% confidence interval.
  • Figure 2: Evaluating safety and novelty of DEPSUE under various logging policies; (a) relative policy value compared to that of the logging policy ($V(\pi) / V(\pi_0)$) and (b) novelty, averaged over 30 simulation runs with 95% confidence interval.
  • Figure 3: Overview of Safe and Deployment-Efficient Policy Learning for User Exploration (DEPSUE), which gradually relaxes the safety regularization to sufficiently explore novel actions while guaranteeing safety.
  • Figure 4: Relative value of the policies of (Left) OPG (naive) and (Right) OPG (w/ CQL) estimated by various validation OPE estimators
  • Figure 5: Comparing relative value and novelty of the $k$-th deployment policies of DEPSUE ($K=5$).
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof