Table of Contents
Fetching ...

Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective

Lucas Maystre, Daniel Russo, Yu Zhao

TL;DR

A novel podcast recommender system deployed at industrial scale that successfully optimizes personal listening journeys that unfold over months for hundreds of millions of listeners is presented, and a comprehensive model of users' recurring relationships with a recommender system is formulated.

Abstract

We present a novel podcast recommender system deployed at industrial scale. This system successfully optimizes personal listening journeys that unfold over months for hundreds of millions of listeners. In deviating from the pervasive industry practice of optimizing machine learning algorithms for short-term proxy metrics, the system substantially improves long-term performance in A/B tests. The paper offers insights into how our methods cope with attribution, coordination, and measurement challenges that usually hinder such long-term optimization. To contextualize these practical insights within a broader academic framework, we turn to reinforcement learning (RL). Using the language of RL, we formulate a comprehensive model of users' recurring relationships with a recommender system. Then, within this model, we identify our approach as a policy improvement update to a component of the existing recommender system, enhanced by tailored modeling of value functions and user-state representations. Illustrative offline experiments suggest this specialized modeling reduces data requirements by as much as a factor of 120,000 compared to black-box approaches.

Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective

TL;DR

A novel podcast recommender system deployed at industrial scale that successfully optimizes personal listening journeys that unfold over months for hundreds of millions of listeners is presented, and a comprehensive model of users' recurring relationships with a recommender system is formulated.

Abstract

We present a novel podcast recommender system deployed at industrial scale. This system successfully optimizes personal listening journeys that unfold over months for hundreds of millions of listeners. In deviating from the pervasive industry practice of optimizing machine learning algorithms for short-term proxy metrics, the system substantially improves long-term performance in A/B tests. The paper offers insights into how our methods cope with attribution, coordination, and measurement challenges that usually hinder such long-term optimization. To contextualize these practical insights within a broader academic framework, we turn to reinforcement learning (RL). Using the language of RL, we formulate a comprehensive model of users' recurring relationships with a recommender system. Then, within this model, we identify our approach as a policy improvement update to a component of the existing recommender system, enhanced by tailored modeling of value functions and user-state representations. Illustrative offline experiments suggest this specialized modeling reduces data requirements by as much as a factor of 120,000 compared to black-box approaches.
Paper Structure (66 sections, 4 theorems, 61 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 66 sections, 4 theorems, 61 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Corollary 1

Under reward functions satisfying eq:separable-rewards, and Assumptions assumption:direct-short-term-impact-assumption:bonus2, if $S_t \neq \varnothing$ then for each $a\in \mathbb A_\star$, where $Z_{t+1,a}^{+} = f(Z_{t,a},1)$ and $Z_{t+1,a}^{-} = f(Z_{t,a},0)$ denote successor content-relationship-states that follow a listen and no-listen, respectively and $b_{\pi^0}(S_t)$ does not depend on th

Figures (10)

  • Figure 1: Explaining RL models of personalization, by contrast with contextual bandit models: In both types of models, the goal is to learn from interacting with users how to optimize interactions with future users. Contextual bandit algorithms learn to optimize the immediate reward accrued from a recommendation action li2010contextual. RL models aim to optimize a sequence of interactions with an individual user, acknowledging that recommendation decisions in one period can impact the efficacy of recommendations in future periods.
  • Figure 2: A depiction of a possible trajectory of user interactions over 60 days. Highlighted in green on the first day is an interaction when the user searches for the term 'podcast' and receives personalized recommendations through which they discover Podcast X, a hypothetical podcast show that releases new episodes on a regular cadence. Subsequent recommendations resurface the show, and sixty days later the user has formed a deep connection and is still listening to Podcast X.
  • Figure 3: Screenshots of Spotify mobile application. The banner component displays a single content item, whereas the podcast discovery shelf contains up to 20 cards that the user can scroll through.
  • Figure 4: Graphical model encoding Assumptions \ref{['assumption:direct-short-term-impact']} and \ref{['assumption:markov-in-time']}.
  • Figure 5: Nearly all discoveries from podcast recommendations can be causally credited to the recommender system.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Remark 1: Implicit discounting
  • Remark 2: Causal interpretation of conditional expectations
  • Example 1: Exponential-moving-average relationship states
  • Remark 3: Comment on stickiness estimation
  • Corollary 1
  • Theorem 1
  • Remark 4: New representations, rather than fine-tuning
  • proof
  • Definition 1
  • Lemma 1
  • ...and 2 more