Table of Contents
Fetching ...

COPA: Comparing the incomparable in multi-objective model evaluation

Adrián Javaloy, Antonio Vergari, Isabel Valera

TL;DR

COPA tackles the challenge of comparing multi-objective ML models when objectives are heterogeneous and incommensurable. It introduces a universal normalization based on the cumulative distribution function to map each objective to a common [0,1] scale via the probability integral transform, and then aggregates these via a parametric weighted p-norm, controlled by $p$ and $\boldsymbol{\omega}$, to reflect user preferences. The approach yields Pareto-optimal selections and robust frontier exploration, demonstrated across model selection, benchmarking, fairness-accuracy trade-offs, and LLM cost-performance analyses. By leveraging rank-based approximations and CDF-transformed criteria, COPA provides a principled, adaptable framework for navigating the Pareto front in diverse ML evaluation scenarios, reducing ad hoc normalization biases. The work highlights practical benefits for AutoML benchmarking, domain generalization, and AI model marketplaces, while noting limitations related to sampling, correlations, and the need for richer preference modeling in future work.

Abstract

In machine learning (ML), we often need to choose one among hundreds of trained ML models at hand, based on various objectives such as accuracy, robustness, fairness or scalability. However, it is often unclear how to compare, aggregate and, ultimately, trade-off these objectives, making it a time-consuming task that requires expert knowledge, as objectives may be measured in different units and scales. In this work, we investigate how objectives can be automatically normalized and aggregated to systematically help the user navigate their Pareto front. To this end, we make incomparable objectives comparable using their cumulative functions, approximated by their relative rankings. As a result, our proposed approach, COPA, can aggregate them while matching user-specific preferences, allowing practitioners to meaningfully navigate and search for models in the Pareto front. We demonstrate the potential impact of COPA in both model selection and benchmarking tasks across diverse ML areas such as fair ML, domain generalization, AutoML and foundation models, where classical ways to normalize and aggregate objectives fall short.

COPA: Comparing the incomparable in multi-objective model evaluation

TL;DR

COPA tackles the challenge of comparing multi-objective ML models when objectives are heterogeneous and incommensurable. It introduces a universal normalization based on the cumulative distribution function to map each objective to a common [0,1] scale via the probability integral transform, and then aggregates these via a parametric weighted p-norm, controlled by and , to reflect user preferences. The approach yields Pareto-optimal selections and robust frontier exploration, demonstrated across model selection, benchmarking, fairness-accuracy trade-offs, and LLM cost-performance analyses. By leveraging rank-based approximations and CDF-transformed criteria, COPA provides a principled, adaptable framework for navigating the Pareto front in diverse ML evaluation scenarios, reducing ad hoc normalization biases. The work highlights practical benefits for AutoML benchmarking, domain generalization, and AI model marketplaces, while noting limitations related to sampling, correlations, and the need for richer preference modeling in future work.

Abstract

In machine learning (ML), we often need to choose one among hundreds of trained ML models at hand, based on various objectives such as accuracy, robustness, fairness or scalability. However, it is often unclear how to compare, aggregate and, ultimately, trade-off these objectives, making it a time-consuming task that requires expert knowledge, as objectives may be measured in different units and scales. In this work, we investigate how objectives can be automatically normalized and aggregated to systematically help the user navigate their Pareto front. To this end, we make incomparable objectives comparable using their cumulative functions, approximated by their relative rankings. As a result, our proposed approach, COPA, can aggregate them while matching user-specific preferences, allowing practitioners to meaningfully navigate and search for models in the Pareto front. We demonstrate the potential impact of COPA in both model selection and benchmarking tasks across diverse ML areas such as fair ML, domain generalization, AutoML and foundation models, where classical ways to normalize and aggregate objectives fall short.

Paper Structure

This paper contains 60 sections, 1 theorem, 17 equations, 19 figures, 2 tables.

Key Result

proposition 1

$\empevarcdf_\indexone$ is an unbiased estimator of the CDF at $\evarcrit_\indexone$, $\evarcdf_\indexone = \cumulative(\evarcrit_\indexone)$, with variance ${\evarcdf_\indexone(1-\evarcdf_\indexone)}/{\numsamples}$. Therefore, the variance of $\empevarcdf_\indexone$ decreases linearly with , and ha

Figures (19)

  • Figure 1: meaningfully navigates the performance-cost trade-offs in the Open LLM Leaderboard, sensibly mapping the importance of CO2 cost to the Pareto front . In contrast, existing approaches such as AHP and SAW (see \ref{['app:sec:baselines']}) are either biased toward one of the or find few solutions (colored dots). This is reflected in the retrieved LLMs where maps $\alpha=1/2$ to a top-18% model for both .
  • Figure 2: As we apply different normalization functions to the synthetic Pareto front from \ref{['subsec:exp-synthetic']} to solve \ref{['eq:ps-inf-norm-problem-phi']}, only meaningfully navigates it as we change $\alpha$.
  • Figure 3: Distribution of solutions (circles) found for different values of $p$ as we sweep over values of $\alpha$. The darkness of the circles represents the number of times they were selected by changing $\alpha$.
  • Figure 4: helps us meaningfully explore the Pareto front of the Open LLM Leaderboard open-llm-leaderboard-v2. We use $p=\infty$, 7.0 , and highlight some selected models as we change the value of $\alpha$.
  • Figure 5: can be used to meaningfully explore accuracy-fairness trade-offs in the CelebA experiment from maheshwari2022fairgrad in unconstrained as well as user-constrained scenarios .
  • ...and 14 more figures

Theorems & Definitions (2)

  • proposition 1
  • proof