Table of Contents
Fetching ...

Pareto optimal proxy metrics

Alessandro Zito, Dylan Greaves, Jacopo Soriano, Lee Richardson

TL;DR

The paper addresses the practical problem of evaluating experimental impact when the north star metric is either insensitive or diverges in the short vs. long term. It introduces Pareto optimal proxy metrics, a multi-objective framework that jointly optimizes short-term sensitivity and alignment with long-term outcomes by learning a weighted linear combination of auxiliary metrics. The authors propose two simple yet effective algorithms to extract the Pareto front and demonstrate substantial gains: proxies that are up to roughly 8–10 times more sensitive while maintaining correct directional movement relative to the north star, validated on a large-scale industrial recommender system with holdout data. The work provides practical guidance for deploying proxy metrics, discusses broader benefits and limitations, and outlines future directions including causality considerations, sparsity, and non-linear proxy constructions for improved adaptability and robustness.

Abstract

North star metrics and online experimentation play a central role in how technology companies improve their products. In many practical settings, however, evaluating experiments based on the north star metric directly can be difficult. The two most significant issues are 1) low sensitivity of the north star metric and 2) differences between the short-term and long-term impact on the north star metric. A common solution is to rely on proxy metrics rather than the north star in experiment evaluation and launch decisions. Existing literature on proxy metrics concentrates mainly on the estimation of the long-term impact from short-term experimental data. In this paper, instead, we focus on the trade-off between the estimation of the long-term impact and the sensitivity in the short term. In particular, we propose the Pareto optimal proxy metrics method, which simultaneously optimizes prediction accuracy and sensitivity. In addition, we give an efficient multi-objective optimization algorithm that outperforms standard methods. We applied our methodology to experiments from a large industrial recommendation system, and found proxy metrics that are eight times more sensitive than the north star and consistently moved in the same direction, increasing the velocity and the quality of the decisions to launch new features.

Pareto optimal proxy metrics

TL;DR

The paper addresses the practical problem of evaluating experimental impact when the north star metric is either insensitive or diverges in the short vs. long term. It introduces Pareto optimal proxy metrics, a multi-objective framework that jointly optimizes short-term sensitivity and alignment with long-term outcomes by learning a weighted linear combination of auxiliary metrics. The authors propose two simple yet effective algorithms to extract the Pareto front and demonstrate substantial gains: proxies that are up to roughly 8–10 times more sensitive while maintaining correct directional movement relative to the north star, validated on a large-scale industrial recommender system with holdout data. The work provides practical guidance for deploying proxy metrics, discusses broader benefits and limitations, and outlines future directions including causality considerations, sparsity, and non-linear proxy constructions for improved adaptability and robustness.

Abstract

North star metrics and online experimentation play a central role in how technology companies improve their products. In many practical settings, however, evaluating experiments based on the north star metric directly can be difficult. The two most significant issues are 1) low sensitivity of the north star metric and 2) differences between the short-term and long-term impact on the north star metric. A common solution is to rely on proxy metrics rather than the north star in experiment evaluation and launch decisions. Existing literature on proxy metrics concentrates mainly on the estimation of the long-term impact from short-term experimental data. In this paper, instead, we focus on the trade-off between the estimation of the long-term impact and the sensitivity in the short term. In particular, we propose the Pareto optimal proxy metrics method, which simultaneously optimizes prediction accuracy and sensitivity. In addition, we give an efficient multi-objective optimization algorithm that outperforms standard methods. We applied our methodology to experiments from a large industrial recommendation system, and found proxy metrics that are eight times more sensitive than the north star and consistently moved in the same direction, increasing the velocity and the quality of the decisions to launch new features.
Paper Structure (17 sections, 13 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 17 sections, 13 equations, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: A simulated example of two cases where a proxy metric is useful. The left figure shows the case where the north star metric is positive, but is too small relative to the noise to measure accurately. This may happen, for instance, when the north star represents a large positive user action, which happens infrequently. The right figure shows the case where the north star metric is significantly different in the short and long-term, and the proxy metric reflects the long-term impact early in the experiment.
  • Figure 2: The relationship between correlation and sensitivity for 70 auxiliary metrics across over 300 experiments. Each metric is either a gray or black dot. We highlight several auxiliary metrics that trade-off between sensitivity and correlation in black. Notably, the short-term value of the north star is in the bottom right, which is the least sensitive metric, but the most correlated with the long-term impact of the north star.
  • Figure 3: An example of the Pareto front in the proxy metric problem. Each gray dot represents evaluations of the objective in a randomized search. The red dots are points on the Pareto front. The green dot is a point that is Pareto dominated, and the gray shaded area shows where the green dot is Pareto dominated.
  • Figure 4: Pareto front extracted under the three methods for increasing number of auxiliary metrics.
  • Figure 5: On the left: Area under the Pareto curve for each algorithm. On the right: running time in seconds to extract the Pareto front in Figure \ref{['fig:algcompare']} for each algorithm.
  • ...and 3 more figures