Table of Contents
Fetching ...

The Quest for Reliable Metrics of Responsible AI

Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Christina Lioma

TL;DR

This work critiques fairness metrics used in recommender systems, diagnosing issues such as undefined ranges, computation failures, and interpretability gaps that undermine metric reliability. It introduces metric corrections with min-max normalization and a novel joint metric that combines performance and fairness, accompanied by open-source code. The authors distill practical guidelines for designing reliable metrics and argue these principles generalize to AI in science, highlighting the need for policy-aligned evaluation. They advocate cross-disciplinary collaboration to audit and embed reliable metrics within high-stakes AIS deployments and governance.

Abstract

The development of Artificial Intelligence (AI), including AI in Science (AIS), should be done following the principles of responsible AI. Progress in responsible AI is often quantified through evaluation metrics, yet there has been less work on assessing the robustness and reliability of the metrics themselves. We reflect on prior work that examines the robustness of fairness metrics for recommender systems as a type of AI application and summarise their key takeaways into a set of non-exhaustive guidelines for developing reliable metrics of responsible AI. Our guidelines apply to a broad spectrum of AI applications, including AIS.

The Quest for Reliable Metrics of Responsible AI

TL;DR

This work critiques fairness metrics used in recommender systems, diagnosing issues such as undefined ranges, computation failures, and interpretability gaps that undermine metric reliability. It introduces metric corrections with min-max normalization and a novel joint metric that combines performance and fairness, accompanied by open-source code. The authors distill practical guidelines for designing reliable metrics and argue these principles generalize to AI in science, highlighting the need for policy-aligned evaluation. They advocate cross-disciplinary collaboration to audit and embed reliable metrics within high-stakes AIS deployments and governance.

Abstract

The development of Artificial Intelligence (AI), including AI in Science (AIS), should be done following the principles of responsible AI. Progress in responsible AI is often quantified through evaluation metrics, yet there has been less work on assessing the robustness and reliability of the metrics themselves. We reflect on prior work that examines the robustness of fairness metrics for recommender systems as a type of AI application and summarise their key takeaways into a set of non-exhaustive guidelines for developing reliable metrics of responsible AI. Our guidelines apply to a broad spectrum of AI applications, including AIS.

Paper Structure

This paper contains 9 sections.