Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective

Lucas Maystre; Daniel Russo; Yu Zhao

Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective

Lucas Maystre, Daniel Russo, Yu Zhao

TL;DR

A novel podcast recommender system deployed at industrial scale that successfully optimizes personal listening journeys that unfold over months for hundreds of millions of listeners is presented, and a comprehensive model of users' recurring relationships with a recommender system is formulated.

Abstract

We present a novel podcast recommender system deployed at industrial scale. This system successfully optimizes personal listening journeys that unfold over months for hundreds of millions of listeners. In deviating from the pervasive industry practice of optimizing machine learning algorithms for short-term proxy metrics, the system substantially improves long-term performance in A/B tests. The paper offers insights into how our methods cope with attribution, coordination, and measurement challenges that usually hinder such long-term optimization. To contextualize these practical insights within a broader academic framework, we turn to reinforcement learning (RL). Using the language of RL, we formulate a comprehensive model of users' recurring relationships with a recommender system. Then, within this model, we identify our approach as a policy improvement update to a component of the existing recommender system, enhanced by tailored modeling of value functions and user-state representations. Illustrative offline experiments suggest this specialized modeling reduces data requirements by as much as a factor of 120,000 compared to black-box approaches.

Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective

TL;DR

Abstract

Paper Structure (66 sections, 4 theorems, 61 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 66 sections, 4 theorems, 61 equations, 10 figures, 2 tables, 1 algorithm.

Introduction
Broader insights into the challenges solutions must overcome
Outline
Related literature
Recommender systems.
Surrogate outcomes and proxy-metrics.
RL for optimizing a recommendation systems.
Other MDP models of recurring customer interactions.
An RL Model of our objective: using historical data to improve a component of a recommendation policy
A specific recommendation task to motivate abstract modeling.
Discrete time-period modeling of sequential of recommendations, engagement, and rewards.
Theoretical generative model of user behavior.
Recommendation policies and user lifetime reward.
The incumbent policy and logged data.
Improving a component of the policy.
...and 51 more sections

Key Result

Corollary 1

Under reward functions satisfying eq:separable-rewards, and Assumptions assumption:direct-short-term-impact-assumption:bonus2, if $S_t \neq \varnothing$ then for each $a\in \mathbb A_\star$, where $Z_{t+1,a}^{+} = f(Z_{t,a},1)$ and $Z_{t+1,a}^{-} = f(Z_{t,a},0)$ denote successor content-relationship-states that follow a listen and no-listen, respectively and $b_{\pi^0}(S_t)$ does not depend on th

Figures (10)

Figure 1: Explaining RL models of personalization, by contrast with contextual bandit models: In both types of models, the goal is to learn from interacting with users how to optimize interactions with future users. Contextual bandit algorithms learn to optimize the immediate reward accrued from a recommendation action li2010contextual. RL models aim to optimize a sequence of interactions with an individual user, acknowledging that recommendation decisions in one period can impact the efficacy of recommendations in future periods.
Figure 2: A depiction of a possible trajectory of user interactions over 60 days. Highlighted in green on the first day is an interaction when the user searches for the term 'podcast' and receives personalized recommendations through which they discover Podcast X, a hypothetical podcast show that releases new episodes on a regular cadence. Subsequent recommendations resurface the show, and sixty days later the user has formed a deep connection and is still listening to Podcast X.
Figure 3: Screenshots of Spotify mobile application. The banner component displays a single content item, whereas the podcast discovery shelf contains up to 20 cards that the user can scroll through.
Figure 4: Graphical model encoding Assumptions \ref{['assumption:direct-short-term-impact']} and \ref{['assumption:markov-in-time']}.
Figure 5: Nearly all discoveries from podcast recommendations can be causally credited to the recommender system.
...and 5 more figures

Theorems & Definitions (12)

Remark 1: Implicit discounting
Remark 2: Causal interpretation of conditional expectations
Example 1: Exponential-moving-average relationship states
Remark 3: Comment on stickiness estimation
Corollary 1
Theorem 1
Remark 4: New representations, rather than fine-tuning
proof
Definition 1
Lemma 1
...and 2 more

Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective

TL;DR

Abstract

Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (12)