Long-Term Value of Exploration: Measurements, Findings and Algorithms
Yi Su, Xiangyu Wang, Elaine Ya Le, Liang Liu, Yuening Li, Haokai Lu, Benjamin Lipshitz, Sriraj Badam, Lukasz Heldt, Shuchao Bi, Ed Chi, Cristos Goodrow, Su-Lin Wu, Lexi Baugher, Minmin Chen
TL;DR
This work tackles the challenge of quantifying the long-term value of exploration in production recommender systems by introducing the Discoverable Corpus metric and a user-corpus-codiverted A/B framework to link corpus growth with long-term user satisfaction. It then implements Neural Linear Bandits as a scalable exploration backbone within a large-scale, multi-stage ranking system and validates the approach through extensive live experiments. Key findings show that exploration enlarges the discoverable corpus, improves usefulness of tail and fresh content, and yields sustained gains in user satisfaction, with uncertainty estimates aligning with content and user characteristics. The study offers practical guidance for deploying exploration in industrial systems and points to future work on multi-task exploration and exploration-driven model learning.
Abstract
Effective exploration is believed to positively influence the long-term user experience on recommendation platforms. Determining its exact benefits, however, has been challenging. Regular A/B tests on exploration often measure neutral or even negative engagement metrics while failing to capture its long-term benefits. We here introduce new experiment designs to formally quantify the long-term value of exploration by examining its effects on content corpus, and connecting content corpus growth to the long-term user experience from real-world experiments. Once established the values of exploration, we investigate the Neural Linear Bandit algorithm as a general framework to introduce exploration into any deep learning based ranking systems. We conduct live experiments on one of the largest short-form video recommendation platforms that serves billions of users to validate the new experiment designs, quantify the long-term values of exploration, and to verify the effectiveness of the adopted neural linear bandit algorithm for exploration.
