COPA: Comparing the incomparable in multi-objective model evaluation
Adrián Javaloy, Antonio Vergari, Isabel Valera
TL;DR
COPA tackles the challenge of comparing multi-objective ML models when objectives are heterogeneous and incommensurable. It introduces a universal normalization based on the cumulative distribution function to map each objective to a common [0,1] scale via the probability integral transform, and then aggregates these via a parametric weighted p-norm, controlled by $p$ and $\boldsymbol{\omega}$, to reflect user preferences. The approach yields Pareto-optimal selections and robust frontier exploration, demonstrated across model selection, benchmarking, fairness-accuracy trade-offs, and LLM cost-performance analyses. By leveraging rank-based approximations and CDF-transformed criteria, COPA provides a principled, adaptable framework for navigating the Pareto front in diverse ML evaluation scenarios, reducing ad hoc normalization biases. The work highlights practical benefits for AutoML benchmarking, domain generalization, and AI model marketplaces, while noting limitations related to sampling, correlations, and the need for richer preference modeling in future work.
Abstract
In machine learning (ML), we often need to choose one among hundreds of trained ML models at hand, based on various objectives such as accuracy, robustness, fairness or scalability. However, it is often unclear how to compare, aggregate and, ultimately, trade-off these objectives, making it a time-consuming task that requires expert knowledge, as objectives may be measured in different units and scales. In this work, we investigate how objectives can be automatically normalized and aggregated to systematically help the user navigate their Pareto front. To this end, we make incomparable objectives comparable using their cumulative functions, approximated by their relative rankings. As a result, our proposed approach, COPA, can aggregate them while matching user-specific preferences, allowing practitioners to meaningfully navigate and search for models in the Pareto front. We demonstrate the potential impact of COPA in both model selection and benchmarking tasks across diverse ML areas such as fair ML, domain generalization, AutoML and foundation models, where classical ways to normalize and aggregate objectives fall short.
