Table of Contents
Fetching ...

A Dynamic Model of Performative Human-ML Collaboration: Theory and Empirical Evidence

Tom Sühr, Samira Samadi, Chiara Farronato

TL;DR

This paper develops a dynamic framework for human-ML collaboration in which ML predictions alter human decisions and future model updates, formalized via a collaborative characteristic function $\\Delta_{\\mathbb{U}}$ that links model utility to the utility of human-assisted decisions. It proves, under general conditions, that the system can converge to stable points that differ in utility depending on whether human feedback improves or harms the collaboration, and it tests these ideas with a large knapsack study. The empirical results show that humans can surpass ML guidance at many ML-performance levels, incentives have little effect, and a stable equilibrium near the knapsack optimum is attainable under realistic conditions; a simple “best of two” strategy could further improve outcomes. The work highlights the importance of accounting for performative feedback and human factors when deploying interactive ML systems, with implications for robust and reliable decision-support in domains with imperfect ground truth.

Abstract

Machine learning (ML) models are increasingly used in various applications, from recommendation systems in e-commerce to diagnosis prediction in healthcare. In this paper, we present a novel dynamic framework for thinking about the deployment of ML models in a performative, human-ML collaborative system. In our framework, the introduction of ML recommendations changes the data-generating process of human decisions, which are only a proxy to the ground truth and which are then used to train future versions of the model. We show that this dynamic process in principle can converge to different stable points, i.e. where the ML model and the Human+ML system have the same performance. Some of these stable points are suboptimal with respect to the actual ground truth. As a proof of concept, we conduct an empirical user study with 1,408 participants. In the study, humans solve instances of the knapsack problem with the help of machine learning predictions of varying performance. This is an ideal setting because we can identify the actual ground truth, and evaluate the performance of human decisions supported by ML recommendations. We find that for many levels of ML performance, humans can improve upon the ML predictions. We also find that the improvement could be even higher if humans rationally followed the ML recommendations. Finally, we test whether monetary incentives can increase the quality of human decisions, but we fail to find any positive effect. Using our empirical data to approximate our collaborative system suggests that the learning process would dynamically reach an equilibrium performance that is around 92% of the maximum knapsack value. Our results have practical implications for the deployment of ML models in contexts where human decisions may deviate from the indisputable ground truth.

A Dynamic Model of Performative Human-ML Collaboration: Theory and Empirical Evidence

TL;DR

This paper develops a dynamic framework for human-ML collaboration in which ML predictions alter human decisions and future model updates, formalized via a collaborative characteristic function that links model utility to the utility of human-assisted decisions. It proves, under general conditions, that the system can converge to stable points that differ in utility depending on whether human feedback improves or harms the collaboration, and it tests these ideas with a large knapsack study. The empirical results show that humans can surpass ML guidance at many ML-performance levels, incentives have little effect, and a stable equilibrium near the knapsack optimum is attainable under realistic conditions; a simple “best of two” strategy could further improve outcomes. The work highlights the importance of accounting for performative feedback and human factors when deploying interactive ML systems, with implications for robust and reliable decision-support in domains with imperfect ground truth.

Abstract

Machine learning (ML) models are increasingly used in various applications, from recommendation systems in e-commerce to diagnosis prediction in healthcare. In this paper, we present a novel dynamic framework for thinking about the deployment of ML models in a performative, human-ML collaborative system. In our framework, the introduction of ML recommendations changes the data-generating process of human decisions, which are only a proxy to the ground truth and which are then used to train future versions of the model. We show that this dynamic process in principle can converge to different stable points, i.e. where the ML model and the Human+ML system have the same performance. Some of these stable points are suboptimal with respect to the actual ground truth. As a proof of concept, we conduct an empirical user study with 1,408 participants. In the study, humans solve instances of the knapsack problem with the help of machine learning predictions of varying performance. This is an ideal setting because we can identify the actual ground truth, and evaluate the performance of human decisions supported by ML recommendations. We find that for many levels of ML performance, humans can improve upon the ML predictions. We also find that the improvement could be even higher if humans rationally followed the ML recommendations. Finally, we test whether monetary incentives can increase the quality of human decisions, but we fail to find any positive effect. Using our empirical data to approximate our collaborative system suggests that the learning process would dynamically reach an equilibrium performance that is around 92% of the maximum knapsack value. Our results have practical implications for the deployment of ML models in contexts where human decisions may deviate from the indisputable ground truth.
Paper Structure (29 sections, 2 theorems, 5 equations, 22 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 5 equations, 22 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

(Collaborative Improvement) If $\Delta_{\mathbb{U}}(\mathbb{U}(X,Y_M))\geq\mathbb{U}(X,Y_M)$ for all $M\in \mathcal{M}, X\in\mathcal{X}$. Then $\mathbb{L}_{\Delta_{\mathbb{U}}}(s,t)$, is non-decreasing with $t=1,...,T$ and for sufficiently large $T$ it exists a $t'\in [1,T]$ such that $\mathbb{L}_{\

Figures (22)

  • Figure 1: Collaborative Improvement (left): The firm's collaborative characteristic function and one collaborative learning path, if humans improve on the ML solution. The x-axis denotes the model expected utility, the y-axis denotes expected human+ML utility. The firm deploys a first model with utility (s). Then humans use the model and improve utility by $\delta_1$, leading to expected human+ML utility (1). The firm learns a new model with utility (b) on the new data distribution. This is viable under the assumption that the new model has the same utility as the previous period's human+ML labels, i.e., we can move horizontally from (1) to the 45-degree line at (b). Humans can further improve utility by $\delta_2$, which leads to expected utility (2). The dynamic improvement process continues until it reaches stable point utility (6-d). Collaborative Harm (right): The firm deploys a model with expected utility (s) but the humans, when interacting with the model, decrease utility by $\delta_1$, with expected utility (1). The firm will thus learn a model of utility (b) on the new distribution. The downward spiral continues until stable point (d).
  • Figure 2: Economic Performance Across Treatments. Error bars denote 95% confidence intervals based on standard errors clustered at the user level. Solid bars denote the average economic performance of the submitted solution, striped bars denote the performance if one picked the higher solution between the submitted solution and the provided ML recommendation. Appendix Figure \ref{['fig:performance_barplot_opt']} replicates the analysis using optimality as a measure of utility.
  • Figure 3: Empirical approximations of collaborative characteristic functions for two utility functions: economic performance (left) and optimality (right). Error bars represent confidence intervals based on participant-level clustered standard errors. Significance levels for the estimates of $\delta_{qi}$ are based on t-tests against the null $\delta_{qi}=0$. ***: $p<0.001$
  • Figure 4: Optimality Across Treatments. Error bars denote 95% confidence intervals based on standard errors clustered at the user level. Solid bars denote the average optimality of the submitted solution, striped bars denote the optimality if one picked the best solution between the submitted solution and the provided ML recommendation.
  • Figure 5: Empirical Collaborative Characteristic Function for the "Optimality" utility function. Confidence intervals are based on standard errors clustered at the participant level.
  • ...and 17 more figures

Theorems & Definitions (14)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Definition 5
  • Definition 6
  • ...and 4 more