Cascading Reinforcement Learning
Yihan Du, R. Srikant, Wei Chen
TL;DR
This work generalizes cascading bandits to cascading RL by incorporating user states and state transitions within a cascading MDP. It introduces a novel DP-based BestPerm oracle that enables efficient planning over a combinatorial action space and builds two algorithms: CascadingVI for regret minimization and CascadingBPI for best policy identification, both with near-optimal guarantees. The regret bound $\tilde{O}( H \sqrt{H S N K} )$ and the identification sample complexity $\tilde{O}( H^3 S N / \varepsilon^2 )$ scale polynomially in problem parameters and are independent of $|\\mathcal{A}|$. Empirical results on MovieLens and synthetic data show substantially improved computation and sample efficiency over naive RL adaptations, highlighting the practicality of stateful, sequential recommendation in real-world settings.
Abstract
Cascading bandits have gained popularity in recent years due to their applicability to recommendation systems and online advertising. In the cascading bandit model, at each timestep, an agent recommends an ordered subset of items (called an item list) from a pool of items, each associated with an unknown attraction probability. Then, the user examines the list, and clicks the first attractive item (if any), and after that, the agent receives a reward. The goal of the agent is to maximize the expected cumulative reward. However, the prior literature on cascading bandits ignores the influences of user states (e.g., historical behaviors) on recommendations and the change of states as the session proceeds. Motivated by this fact, we propose a generalized cascading RL framework, which considers the impact of user states and state transition into decisions. In cascading RL, we need to select items not only with large attraction probabilities but also leading to good successor states. This imposes a huge computational challenge due to the combinatorial action space. To tackle this challenge, we delve into the properties of value functions, and design an oracle BestPerm to efficiently find the optimal item list. Equipped with BestPerm, we develop two algorithms CascadingVI and CascadingBPI, which are both computationally-efficient and sample-efficient, and provide near-optimal regret and sample complexity guarantees. Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice.
