Navigating the Evaluation Funnel to Optimize Iteration Speed for Recommender Systems
Claire Schultzberg, Brammert Ottens
TL;DR
The paper addresses how to efficiently fuse offline and online evaluation methods into a single evaluation funnel for recommender and personalized search systems to accelerate product iteration. It introduces a framework that decomposes success into necessary and sufficient criteria, guiding early, offline verification and validation to prune non-viable iterations before engaging in costly online experiments. A comprehensive survey of evaluation methods follows, including counterfactual reconstruction, offline/online verification and validation, sequential testing, variance reduction, exposure filtering, interleaving, MABs, and Bayesian optimization, with practical guidance on when and how to apply them. The work highlights the central role of counterfactual reconstruction in both offline and online contexts and argues for a disciplined, stepwise evaluation process that reduces wasted effort, speeds time-to-ship, and clarifies the mechanisms by which changes affect user experiences and business outcomes.
Abstract
Over the last decades has emerged a rich literature on the evaluation of recommendation systems. However, less is written about how to efficiently combine different evaluation methods from this rich field into a single efficient evaluation funnel. In this paper we aim to build intuition for how to choose evaluation methods, by presenting a novel framework that simplifies the reasoning around the evaluation funnel for a recommendation system. Our contribution is twofold. First we present our framework for how to decompose the definition of success to construct efficient evaluation funnels, focusing on how to identify and discard non-successful iterations quickly. We show that decomposing the definition of success into smaller necessary criteria for success enables early identification of non-successful ideas. Second, we give an overview of the most common and useful evaluation methods, discuss their pros and cons, and how they fit into, and complement each other in, the evaluation process. We go through so-called offline and online evaluation methods such as counterfactual logging, validation, verification, A/B testing, and interleaving. The paper concludes with some general discussion and advice on how to design an efficient evaluation process for recommender systems.
