Table of Contents
Fetching ...

Confidence-weighted integration of human and machine judgments for superior decision-making

Felipe Yáñez, Xiaoliang Luo, Omar Valerio Minero, Bradley C. Love

TL;DR

The paper tackles whether humans can meaningfully contribute to decisions when machines, including LLMs, outperform them. It proposes a confidence-weighted logistic regression framework to integrate judgments from any number of teammates, extending prior Bayesian approaches with a simple, fast, and interpretable method. Across two forecasting benchmarks—noisy ImageNet16H object recognition and BrainBench neuroscience forecasting—the authors show that well-calibrated confidence and diversity among teammates yield complementarity, with human–machine teams outperforming either party alone. The approach generalizes to arbitrary agent sets and offers a practical pathway for productive human–machine collaboration in perceptual and knowledge-intensive tasks, supported by LOOCV validation and accessible data/code.

Abstract

Large language models (LLMs) can surpass humans in certain forecasting tasks. What role does this leave for humans in the overall decision process? One possibility is that humans, despite performing worse than LLMs, can still add value when teamed with them. A human and machine team can surpass each individual teammate when team members' confidence is well-calibrated and team members diverge in which tasks they find difficult (i.e., calibration and diversity are needed). We simplified and extended a Bayesian approach to combining judgments using a logistic regression framework that integrates confidence-weighted judgments for any number of team members. Using this straightforward method, we demonstrated its effectiveness in both image classification and neuroscience forecasting tasks. Combining human judgments with one or more machines consistently improved overall team performance. Our hope is that this simple and effective strategy for integrating the judgments of humans and machines will lead to productive collaborations.

Confidence-weighted integration of human and machine judgments for superior decision-making

TL;DR

The paper tackles whether humans can meaningfully contribute to decisions when machines, including LLMs, outperform them. It proposes a confidence-weighted logistic regression framework to integrate judgments from any number of teammates, extending prior Bayesian approaches with a simple, fast, and interpretable method. Across two forecasting benchmarks—noisy ImageNet16H object recognition and BrainBench neuroscience forecasting—the authors show that well-calibrated confidence and diversity among teammates yield complementarity, with human–machine teams outperforming either party alone. The approach generalizes to arbitrary agent sets and offers a practical pathway for productive human–machine collaboration in perceptual and knowledge-intensive tasks, supported by LOOCV validation and accessible data/code.

Abstract

Large language models (LLMs) can surpass humans in certain forecasting tasks. What role does this leave for humans in the overall decision process? One possibility is that humans, despite performing worse than LLMs, can still add value when teamed with them. A human and machine team can surpass each individual teammate when team members' confidence is well-calibrated and team members diverge in which tasks they find difficult (i.e., calibration and diversity are needed). We simplified and extended a Bayesian approach to combining judgments using a logistic regression framework that integrates confidence-weighted judgments for any number of team members. Using this straightforward method, we demonstrated its effectiveness in both image classification and neuroscience forecasting tasks. Combining human judgments with one or more machines consistently improved overall team performance. Our hope is that this simple and effective strategy for integrating the judgments of humans and machines will lead to productive collaborations.
Paper Structure (4 sections, 8 equations, 14 figures, 1 algorithm)

This paper contains 4 sections, 8 equations, 14 figures, 1 algorithm.

Figures (14)

  • Figure 1: Performance of the confidence-weighted logistic combination model in the noisy object recognition task steyvers2022hai. Accuracy results on high levels of image noise ($\Omega=125$) with the logistic combination model. Human--machine teams (green points) consistently outperform teams without humans (blue points). Each data point corresponds to the average across 7 239 image evaluations. Error bars represent standard error of the mean using a binomial model.
  • Figure 2: Assessing Humans and LLMs using BrainBench luo_large_2024. (A) The benchmark comprises test cases constructed from the Journal of Neuroscience abstracts. Abstracts consist of background, methods, and results. The test-taker chose which of two versions of the abstract was the original version. The altered version maintained coherency while significantly altering the results. The 100 test cases considered here were constructed by GPT-4 with human oversight and quality control. (B) An example test case. Humans were instructed to select which version of the abstract was the original by clicking on either blue or green text to select that set of options. Test cases varied in the numbers of alternatives, but a single click will choose all options of the same color. After their choice, humans indicated their confidence. LLMs chose the version of the abstract that had the lower perplexity score and their confidence was assessed by the absolute difference in perplexity of the two options.
  • Figure 3: Conditions for effective collaboration between human experts and LLMs were satisfied. (A) When human experts and LLMs were confident in their BrainBench judgments, they were more likely to be correct. Confidence ratings were sorted into equal bins, and the mean accuracy for each bin was plotted. The positive slope of the black regression lines for humans and Llama2 chat models (7B, 13B, and 70B) indicates well-calibrated confidence luo_large_2024KEREN1991217Baranski1994tian-etal-2023-just, meaning higher confidence correlates with higher accuracy. (B) Item difficulty Spearman correlations among LLMs and human experts. For LLMs, difference in perplexity between incorrect and correct abstracts was used to determine the relative difficulty of test cases. Mean accuracy was used for human experts. LLMs align more with each other than humans, which implies human--machine teams will be diverse. Heatmap color scale ranges from 0.1 to 0.9. (C) LLMs surpass human experts on BrainBench overall. Error bars represent standard error of the mean using a binomial model.
  • Figure 4: Performance of all possible teams using the confidence-weighted logistic combination model. Adding a human to a team with one or more machines (blue points) always has a benefit (green points). Llama2 chat 7B, 13B, and 70B models are considered. Each data point corresponds to the average across 503 test case evaluations. Error bars represent standard error of the mean using a binomial model.
  • Figure 5: Removing confidence from the logistic combination model diminishes team performance. Accuracy results on the neuroscience forecasting task with the confidence-weighted logistic combination model, where the magnitude of the confidence scores was set to 1, i.e., $f(x)=1$ in Equation (\ref{['eq:squashing']}). Adding a human to a team with one or more machines (blue points) does not necessarily improve performance (green points). Llama2 chat 7B, 13B, and 70B models are considered. Each data point corresponds to the average across 503 test case evaluations. Error bars represent standard error of the mean using a binomial model.
  • ...and 9 more figures