Pareto optimal proxy metrics
Alessandro Zito, Dylan Greaves, Jacopo Soriano, Lee Richardson
TL;DR
The paper addresses the practical problem of evaluating experimental impact when the north star metric is either insensitive or diverges in the short vs. long term. It introduces Pareto optimal proxy metrics, a multi-objective framework that jointly optimizes short-term sensitivity and alignment with long-term outcomes by learning a weighted linear combination of auxiliary metrics. The authors propose two simple yet effective algorithms to extract the Pareto front and demonstrate substantial gains: proxies that are up to roughly 8–10 times more sensitive while maintaining correct directional movement relative to the north star, validated on a large-scale industrial recommender system with holdout data. The work provides practical guidance for deploying proxy metrics, discusses broader benefits and limitations, and outlines future directions including causality considerations, sparsity, and non-linear proxy constructions for improved adaptability and robustness.
Abstract
North star metrics and online experimentation play a central role in how technology companies improve their products. In many practical settings, however, evaluating experiments based on the north star metric directly can be difficult. The two most significant issues are 1) low sensitivity of the north star metric and 2) differences between the short-term and long-term impact on the north star metric. A common solution is to rely on proxy metrics rather than the north star in experiment evaluation and launch decisions. Existing literature on proxy metrics concentrates mainly on the estimation of the long-term impact from short-term experimental data. In this paper, instead, we focus on the trade-off between the estimation of the long-term impact and the sensitivity in the short term. In particular, we propose the Pareto optimal proxy metrics method, which simultaneously optimizes prediction accuracy and sensitivity. In addition, we give an efficient multi-objective optimization algorithm that outperforms standard methods. We applied our methodology to experiments from a large industrial recommendation system, and found proxy metrics that are eight times more sensitive than the north star and consistently moved in the same direction, increasing the velocity and the quality of the decisions to launch new features.
