Table of Contents
Fetching ...

Towards Multi-Stakeholder Evaluation of ML Models: A Crowdsourcing Study on Metric Preferences in Job-matching System

Takuya Yokota, Yuri Nakao

TL;DR

The paper tackles the lack of a universal metric for evaluating ML outputs across diverse stakeholders by introducing a crowdsourcing workflow in a hypothetical job-matching context and a seven-metric utility framework where $U_{ij}=V_{ij}+b5_{ij}$ and $V_{ij}$ depends on seven metrics. It applies K-means clustering and lift-based association analysis to identify five metric-preference clusters among 837 participants and links these clusters to demographic attributes, revealing dominant performance-focused clusters and minority fairness-focused clusters. The findings offer practical guidance for multi-stakeholder evaluation, emphasizing the need to ensure minority representation and context-aware metric selection when assessing ML systems. Overall, the work provides a framework to model stakeholder groups by metric preferences and demographics, informing more equitable and transparent evaluation practices in real-world decision-support settings.

Abstract

While machine learning (ML) technology affects diverse stakeholders, there is no one-size-fits-all metric to evaluate the quality of outputs, including performance and fairness. Using predetermined metrics without soliciting stakeholder opinions is problematic because it leads to an unfair disregard for stakeholders in the ML pipeline. In this study, to establish practical ways to incorporate diverse stakeholder opinions into the selection of metrics for ML, we investigate participants' preferences for different metrics by using crowdsourcing. We ask 837 participants to choose a better model from two hypothetical ML models in a hypothetical job-matching system twenty times and calculate their utility values for seven metrics. To examine the participants' feedback in detail, we divide them into five clusters based on their utility values and analyze the tendencies of each cluster, including their preferences for metrics and common attributes. Based on the results, we discuss the points that should be considered when selecting appropriate metrics and evaluating ML models with multiple stakeholders.

Towards Multi-Stakeholder Evaluation of ML Models: A Crowdsourcing Study on Metric Preferences in Job-matching System

TL;DR

The paper tackles the lack of a universal metric for evaluating ML outputs across diverse stakeholders by introducing a crowdsourcing workflow in a hypothetical job-matching context and a seven-metric utility framework where and depends on seven metrics. It applies K-means clustering and lift-based association analysis to identify five metric-preference clusters among 837 participants and links these clusters to demographic attributes, revealing dominant performance-focused clusters and minority fairness-focused clusters. The findings offer practical guidance for multi-stakeholder evaluation, emphasizing the need to ensure minority representation and context-aware metric selection when assessing ML systems. Overall, the work provides a framework to model stakeholder groups by metric preferences and demographics, informing more equitable and transparent evaluation practices in real-world decision-support settings.

Abstract

While machine learning (ML) technology affects diverse stakeholders, there is no one-size-fits-all metric to evaluate the quality of outputs, including performance and fairness. Using predetermined metrics without soliciting stakeholder opinions is problematic because it leads to an unfair disregard for stakeholders in the ML pipeline. In this study, to establish practical ways to incorporate diverse stakeholder opinions into the selection of metrics for ML, we investigate participants' preferences for different metrics by using crowdsourcing. We ask 837 participants to choose a better model from two hypothetical ML models in a hypothetical job-matching system twenty times and calculate their utility values for seven metrics. To examine the participants' feedback in detail, we divide them into five clusters based on their utility values and analyze the tendencies of each cluster, including their preferences for metrics and common attributes. Based on the results, we discuss the points that should be considered when selecting appropriate metrics and evaluating ML models with multiple stakeholders.

Paper Structure

This paper contains 25 sections, 6 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Discrete choice task: ten job applicants' information, race as Black or White, expertise as high or low, and actual results of matching as hired (true) or not hired (false), with the prediction results of two hypothetical ML models. Only two rows had different sets (with a yellow background for emphasis) to reduce cognitive overload. Both rows and values were randomized in each task.
  • Figure 2: Result of k-means clustering. Each color represents the average preference value of participants assigned to each cluster. The vertical axis represents preference values that have been normalized to a range of $-1$ to $1$. Error bars indicate standard error.