Table of Contents
Fetching ...

MultiScale Contextual Bandits for Long Term Objectives

Richa Rastogi, Yuta Saito, Thorsten Joachims

TL;DR

This work tackles the challenge of aligning short-term feedback with long-term objectives in interactive AI systems by introducing MultiScale Policy Learning (MSPL), a hierarchical framework that operates across multiple timescales. It formulates a two-level (and extendable multi-level) contextual bandit setting in which fast micro-level data informs a slower macro-level policy through data-driven priors derived via a PAC-Bayes perspective. The paper proposes MSBL, a practical algorithm that learns a family of micro policies and a macro policy, enabling efficient optimization of long-term objectives in tasks spanning recommender systems and conversational agents. Empirically, MSBL demonstrates improved long-term rewards and robustness across several simulated and real-data scenarios, highlighting its potential to improve long-horizon outcomes in interactive AI systems.

Abstract

The feedback that AI systems (e.g., recommender systems, chatbots) collect from user interactions is a crucial source of training data. While short-term feedback (e.g., clicks, engagement) is widely used for training, there is ample evidence that optimizing short-term feedback does not necessarily achieve the desired long-term objectives. Unfortunately, directly optimizing for long-term objectives is challenging, and we identify the disconnect in the timescales of short-term interventions (e.g., rankings) and the long-term feedback (e.g., user retention) as one of the key obstacles. To overcome this disconnect, we introduce the framework of MultiScale Policy Learning to contextually reconcile that AI systems need to act and optimize feedback at multiple interdependent timescales. Following a PAC-Bayes motivation, we show how the lower timescales with more plentiful data can provide a data-dependent hierarchical prior for faster learning at higher scales, where data is more scarce. As a result, the policies at all levels effectively optimize for the long-term. We instantiate the framework with MultiScale Off-Policy Bandit Learning (MSBL) and demonstrate its effectiveness on three tasks relating to recommender and conversational systems.

MultiScale Contextual Bandits for Long Term Objectives

TL;DR

This work tackles the challenge of aligning short-term feedback with long-term objectives in interactive AI systems by introducing MultiScale Policy Learning (MSPL), a hierarchical framework that operates across multiple timescales. It formulates a two-level (and extendable multi-level) contextual bandit setting in which fast micro-level data informs a slower macro-level policy through data-driven priors derived via a PAC-Bayes perspective. The paper proposes MSBL, a practical algorithm that learns a family of micro policies and a macro policy, enabling efficient optimization of long-term objectives in tasks spanning recommender systems and conversational agents. Empirically, MSBL demonstrates improved long-term rewards and robustness across several simulated and real-data scenarios, highlighting its potential to improve long-horizon outcomes in interactive AI systems.

Abstract

The feedback that AI systems (e.g., recommender systems, chatbots) collect from user interactions is a crucial source of training data. While short-term feedback (e.g., clicks, engagement) is widely used for training, there is ample evidence that optimizing short-term feedback does not necessarily achieve the desired long-term objectives. Unfortunately, directly optimizing for long-term objectives is challenging, and we identify the disconnect in the timescales of short-term interventions (e.g., rankings) and the long-term feedback (e.g., user retention) as one of the key obstacles. To overcome this disconnect, we introduce the framework of MultiScale Policy Learning to contextually reconcile that AI systems need to act and optimize feedback at multiple interdependent timescales. Following a PAC-Bayes motivation, we show how the lower timescales with more plentiful data can provide a data-dependent hierarchical prior for faster learning at higher scales, where data is more scarce. As a result, the policies at all levels effectively optimize for the long-term. We instantiate the framework with MultiScale Off-Policy Bandit Learning (MSBL) and demonstrate its effectiveness on three tasks relating to recommender and conversational systems.

Paper Structure

This paper contains 28 sections, 25 equations, 14 figures, 1 table, 3 algorithms.

Figures (14)

  • Figure 1: MultiScale feedback $r$ with corresponding interventions $a$ at each level ($L1, L2, L3$). At the short-term level, engagement feedback (e.g., responses, clicks) is observed at the fastest timescale. At the next higher level, we observe feedback like the weekly return rate. And at an even higher level, subscription renewal is observed at the slowest timescale.
  • Figure 2: (a) At inference, a macro action indexes to select the particular micro policy from a family of micro policies. The macro action space $\mathcal{A}^{L2}$ is isomorphic to the family of policies $\hat{\Pi}^{{L1}}$ (b) Learning micro policies: Abundant micro-level data is used to learn promising policies $\hat{\Pi}^{{L1}}$ using policy or feedback modification (c) Macro-level data is used to learn a macro policy. For more than two levels, (b) and (c) are recursively called narrowing down micro policy space/ macro action space.
  • Figure 3: Multi-turn conversation: (a) Setup for learning preference weights $a^{L2}$ using feedback modification (b) Comparison of long-term (user satisfaction of multi-turn) vs short-term (single-turn) rewards for all users across 5 random seeds.
  • Figure 4: Conversational recommender system: (a) Tradeoff between longer-term Level 2 and short-term Level 1 rewards using decoding temperature $a^{L2} \in \{0.0, 0.2, 0.4, 0.6, 0.8, 1.0\}$ as policy modification. (b) Tradeoff between expected rewards at all three levels. Expected rewards are reported across 5 random seeds for all users.
  • Figure 5: Recommender system: Tradeoff between long term return rate and clicks by varying groups using the boost $a^{L2}$ to the ranking as policy modification. Expected short and long term rewards are reported across 5 random seeds.
  • ...and 9 more figures