Table of Contents
Fetching ...

Leveraging Expert Consistency to Improve Algorithmic Decision Support

Maria De-Arteaga, Vincent Jeanselme, Artur Dubrawski, Alexandra Chouldechova

TL;DR

This paper tackles the construct gap between the decision criterion of interest $Y^c$ and proxies $Y$ and $D$ in high-stakes decision support. It proposes a two-stage strategy: (i) estimate expert consistency using influence functions when each case has a single expert decision and (ii) amalgamate labels so that the model learns from expert decisions in consistently assessed cases and from observed outcomes otherwise, producing $Y^{\mathcal{A}}$. The methodology is validated through semi-synthetic simulations and a real-world child welfare dataset, demonstrating improved predictive performance and a narrowed construct gap compared to learning from $Y$ or $D$ alone. The work offers a practical, robust approach for integrating expert decision history into ML decision-support systems while addressing non-random expert assignments and potential bias, with implications for policy and deployment in organizations that rely on archival expert decisions.

Abstract

Machine learning (ML) is increasingly being used to support high-stakes decisions. However, there is frequently a construct gap: a gap between the construct of interest to the decision-making task and what is captured in proxies used as labels to train ML models. As a result, ML models may fail to capture important dimensions of decision criteria, hampering their utility for decision support. Thus, an essential step in the design of ML systems for decision support is selecting a target label among available proxies. In this work, we explore the use of historical expert decisions as a rich -- yet also imperfect -- source of information that can be combined with observed outcomes to narrow the construct gap. We argue that managers and system designers may be interested in learning from experts in instances where they exhibit consistency with each other, while learning from observed outcomes otherwise. We develop a methodology to enable this goal using information that is commonly available in organizational information systems. This involves two core steps. First, we propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert. Second, we introduce a label amalgamation approach that allows ML models to simultaneously learn from expert decisions and observed outcomes. Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap, yielding better predictive performance than learning from either observed outcomes or expert decisions alone.

Leveraging Expert Consistency to Improve Algorithmic Decision Support

TL;DR

This paper tackles the construct gap between the decision criterion of interest and proxies and in high-stakes decision support. It proposes a two-stage strategy: (i) estimate expert consistency using influence functions when each case has a single expert decision and (ii) amalgamate labels so that the model learns from expert decisions in consistently assessed cases and from observed outcomes otherwise, producing . The methodology is validated through semi-synthetic simulations and a real-world child welfare dataset, demonstrating improved predictive performance and a narrowed construct gap compared to learning from or alone. The work offers a practical, robust approach for integrating expert decision history into ML decision-support systems while addressing non-random expert assignments and potential bias, with implications for policy and deployment in organizations that rely on archival expert decisions.

Abstract

Machine learning (ML) is increasingly being used to support high-stakes decisions. However, there is frequently a construct gap: a gap between the construct of interest to the decision-making task and what is captured in proxies used as labels to train ML models. As a result, ML models may fail to capture important dimensions of decision criteria, hampering their utility for decision support. Thus, an essential step in the design of ML systems for decision support is selecting a target label among available proxies. In this work, we explore the use of historical expert decisions as a rich -- yet also imperfect -- source of information that can be combined with observed outcomes to narrow the construct gap. We argue that managers and system designers may be interested in learning from experts in instances where they exhibit consistency with each other, while learning from observed outcomes otherwise. We develop a methodology to enable this goal using information that is commonly available in organizational information systems. This involves two core steps. First, we propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert. Second, we introduce a label amalgamation approach that allows ML models to simultaneously learn from expert decisions and observed outcomes. Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap, yielding better predictive performance than learning from either observed outcomes or expert decisions alone.

Paper Structure

This paper contains 63 sections, 2 theorems, 34 equations, 11 figures, 15 tables, 8 algorithms.

Key Result

Theorem 1

Given an estimated probability ${f}_D={P}(D=1 \mid X)$ and a set $\mathcal{A}_{1}=\{\bm{x} \in \mathbf{X} : {f}_D(\bm{x})\geq \delta \}$, the confidence interval of the true probability, $P(D=1 \mid X)$, for data points in $\mathcal{A}_{1}$ can be estimated as: where $CI(P; C)= (a,b)$ denotes that $(a,b)$ is the confidence interval of P at level C. Here, $\mathbf{X}_{v}\in \mathbb{R}^{k \times n

Figures (11)

  • Figure 1: Diagram summarizing the steps of the proposed methodology.
  • Figure 2: Pipeline of child abuse hotline investigations.
  • Figure 3: Precision for top $p\%$ highest scored screened-in cases divided by outcome. Error bars show the standard deviation over 10 runs of $75-25\%$ Monte Carlo cross-validation. 'Overall prev.' is the prevalence of each outcome. Learning from $Y$ alone yields poor performance with respect to Services and Substantiated, while learning from expert decisions, $D$, alone yields poor performance with respect to Out of Home placement (OOH), even when applying a method robust to noise ($f_{noise}$). Combining both labels through $f_{weak}$, $f_{ens}$ and $f_{\mathcal{A}}$ provides the "best of both worlds". Which one is preferable depends on the specific threshold and sensitivity to the different outcomes. With respect to the priority outcome, OOH, $f_{\mathcal{A}}$ has the lowest decline in performance as thresholds increase.
  • Figure 4: Precision for top $25\%$ highest scored screened-in cases by model and outcomes. Error bars show the standard deviation over 10 runs of $75-25\%$ Monte Carlo cross-validation. Learning from $Y$ alone yields poor performance with respect to Services and Substantiated, while learning from expert decisions, $D$, alone yields poor performance with respect to OOH, even when applying a method robust to noise ($f_{noise}$). Combining both labels through different approaches, $f_{weak}$, $f_{ens}$ and $f_{\mathcal{A}}$ provides the "best of both worlds".
  • Figure 5: Mean influence per expert in non-random expert-to-patient assignment, near-deterministic bias setting, estimated across all cases in each subgroup. Expert 0 is the expert that assesses 95% of women's cases and always screens them out.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2