Cascading Reinforcement Learning

Yihan Du; R. Srikant; Wei Chen

Cascading Reinforcement Learning

Yihan Du, R. Srikant, Wei Chen

TL;DR

This work generalizes cascading bandits to cascading RL by incorporating user states and state transitions within a cascading MDP. It introduces a novel DP-based BestPerm oracle that enables efficient planning over a combinatorial action space and builds two algorithms: CascadingVI for regret minimization and CascadingBPI for best policy identification, both with near-optimal guarantees. The regret bound $\tilde{O}( H \sqrt{H S N K} )$ and the identification sample complexity $\tilde{O}( H^3 S N / \varepsilon^2 )$ scale polynomially in problem parameters and are independent of $|\\mathcal{A}|$. Empirical results on MovieLens and synthetic data show substantially improved computation and sample efficiency over naive RL adaptations, highlighting the practicality of stateful, sequential recommendation in real-world settings.

Abstract

Cascading bandits have gained popularity in recent years due to their applicability to recommendation systems and online advertising. In the cascading bandit model, at each timestep, an agent recommends an ordered subset of items (called an item list) from a pool of items, each associated with an unknown attraction probability. Then, the user examines the list, and clicks the first attractive item (if any), and after that, the agent receives a reward. The goal of the agent is to maximize the expected cumulative reward. However, the prior literature on cascading bandits ignores the influences of user states (e.g., historical behaviors) on recommendations and the change of states as the session proceeds. Motivated by this fact, we propose a generalized cascading RL framework, which considers the impact of user states and state transition into decisions. In cascading RL, we need to select items not only with large attraction probabilities but also leading to good successor states. This imposes a huge computational challenge due to the combinatorial action space. To tackle this challenge, we delve into the properties of value functions, and design an oracle BestPerm to efficiently find the optimal item list. Equipped with BestPerm, we develop two algorithms CascadingVI and CascadingBPI, which are both computationally-efficient and sample-efficient, and provide near-optimal regret and sample complexity guarantees. Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice.

Cascading Reinforcement Learning

TL;DR

and the identification sample complexity

scale polynomially in problem parameters and are independent of

. Empirical results on MovieLens and synthetic data show substantially improved computation and sample efficiency over naive RL adaptations, highlighting the practicality of stateful, sequential recommendation in real-world settings.

Abstract

Paper Structure (34 sections, 25 theorems, 102 equations, 3 figures, 3 algorithms)

This paper contains 34 sections, 25 theorems, 102 equations, 3 figures, 3 algorithms.

Introduction
Related Work
Problem Formulation
An Efficient Oracle for Cascading RL
Crucial Properties of Problem \ref{['eq:oracle_problem']}
Oracle $\mathtt{BestPerm}$
Regret Minimization for Cascading RL
Algorithm $\mathtt{CascadingVI}$
Theoretical Guarantee of Algorithm $\mathtt{CascadingVI}$
Best Policy Identification for Cascading RL
Experiments
Conclusion
More Experiments
Experimental Setup with Real-world Data
Experiments on Synthetic Data
...and 19 more sections

Key Result

Lemma 1

The weighted cascading reward function $f(A,u,w)$ satisfies the following properties:

Figures (3)

Figure 1: Experiments for cascading RL on real-world data.
Figure 2: The constructed cascading MDP in synthetic data.
Figure 3: Experiments for cascading RL on synthetic data.

Theorems & Definitions (48)

Lemma 1
Lemma 2: Correctness of Oracle $\mathtt{BestPerm}$
Theorem 1: Regret Upper Bound
Theorem 2
Lemma 3: Interchange by Descending Weights
proof : Proof of Lemma \ref{['lemma:exchange_by_weight']}
Lemma 4: Items behind $a_{\bot}$ Do Not Matter
proof
proof : Proof of Lemma \ref{['lemma:f_property']}
proof : Proof of Lemma \ref{['lemma:guarantee_oracle']}
...and 38 more

Cascading Reinforcement Learning

TL;DR

Abstract

Cascading Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (48)