Towards Multi-Stakeholder Evaluation of ML Models: A Crowdsourcing Study on Metric Preferences in Job-matching System
Takuya Yokota, Yuri Nakao
TL;DR
The paper tackles the lack of a universal metric for evaluating ML outputs across diverse stakeholders by introducing a crowdsourcing workflow in a hypothetical job-matching context and a seven-metric utility framework where $U_{ij}=V_{ij}+b5_{ij}$ and $V_{ij}$ depends on seven metrics. It applies K-means clustering and lift-based association analysis to identify five metric-preference clusters among 837 participants and links these clusters to demographic attributes, revealing dominant performance-focused clusters and minority fairness-focused clusters. The findings offer practical guidance for multi-stakeholder evaluation, emphasizing the need to ensure minority representation and context-aware metric selection when assessing ML systems. Overall, the work provides a framework to model stakeholder groups by metric preferences and demographics, informing more equitable and transparent evaluation practices in real-world decision-support settings.
Abstract
While machine learning (ML) technology affects diverse stakeholders, there is no one-size-fits-all metric to evaluate the quality of outputs, including performance and fairness. Using predetermined metrics without soliciting stakeholder opinions is problematic because it leads to an unfair disregard for stakeholders in the ML pipeline. In this study, to establish practical ways to incorporate diverse stakeholder opinions into the selection of metrics for ML, we investigate participants' preferences for different metrics by using crowdsourcing. We ask 837 participants to choose a better model from two hypothetical ML models in a hypothetical job-matching system twenty times and calculate their utility values for seven metrics. To examine the participants' feedback in detail, we divide them into five clusters based on their utility values and analyze the tendencies of each cluster, including their preferences for metrics and common attributes. Based on the results, we discuss the points that should be considered when selecting appropriate metrics and evaluating ML models with multiple stakeholders.
