Table of Contents
Fetching ...

Human and AI Perceptual Differences in Image Classification Errors

Minghao Liu, Jiaheng Wei, Yang Liu, James Davis

TL;DR

The paper investigates perceptual differences between human and machine classifiers in image classification beyond overall accuracy. It analyzes confusion matrices and partitions task difficulty using machine confidence, machine agreement, and human effort, supplemented by hypothesis testing and collaboration experiments. The findings show machines tend to make similar mistakes across models, while humans exhibit different error patterns, and that human-machine collaboration can achieve higher accuracy than either alone, including in ideal oracle and threshold-based realistic settings. The work highlights practical implications for designing hybrid AI systems, cautions against assuming machine likeness to human perception, and points to potential benefits in high-stakes domains such as medical imaging and cognitive modeling.

Abstract

Artificial intelligence (AI) models for computer vision trained with supervised machine learning are assumed to solve classification tasks by imitating human behavior learned from training labels. Most efforts in recent vision research focus on measuring the model task performance using standardized benchmarks such as accuracy. However, limited work has sought to understand the perceptual difference between humans and machines. To fill this gap, this study first analyzes the statistical distributions of mistakes from the two sources and then explores how task difficulty level affects these distributions. We find that even when AI learns an excellent model from the training data, one that outperforms humans in overall accuracy, these AI models have significant and consistent differences from human perception. We demonstrate the importance of studying these differences with a simple human-AI teaming algorithm that outperforms humans alone, AI alone, or AI-AI teaming.

Human and AI Perceptual Differences in Image Classification Errors

TL;DR

The paper investigates perceptual differences between human and machine classifiers in image classification beyond overall accuracy. It analyzes confusion matrices and partitions task difficulty using machine confidence, machine agreement, and human effort, supplemented by hypothesis testing and collaboration experiments. The findings show machines tend to make similar mistakes across models, while humans exhibit different error patterns, and that human-machine collaboration can achieve higher accuracy than either alone, including in ideal oracle and threshold-based realistic settings. The work highlights practical implications for designing hybrid AI systems, cautions against assuming machine likeness to human perception, and points to potential benefits in high-stakes domains such as medical imaging and cognitive modeling.

Abstract

Artificial intelligence (AI) models for computer vision trained with supervised machine learning are assumed to solve classification tasks by imitating human behavior learned from training labels. Most efforts in recent vision research focus on measuring the model task performance using standardized benchmarks such as accuracy. However, limited work has sought to understand the perceptual difference between humans and machines. To fill this gap, this study first analyzes the statistical distributions of mistakes from the two sources and then explores how task difficulty level affects these distributions. We find that even when AI learns an excellent model from the training data, one that outperforms humans in overall accuracy, these AI models have significant and consistent differences from human perception. We demonstrate the importance of studying these differences with a simple human-AI teaming algorithm that outperforms humans alone, AI alone, or AI-AI teaming.
Paper Structure (26 sections, 9 equations, 7 figures, 4 tables)

This paper contains 26 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Stimuli for current study: Representative images from the CIFAR-10 dataset, which includes ten categories of natural images. The perceptual difference between human and AI classifiers is studied using the distribution of mistakes made while predicting categorical labels.
  • Figure 2: Confusion matrices of incorrect answers: The figure shows confusion matrices across permutations of machine classifiers or human annotators. The plot focuses on incorrect predictions from the test subjects to see if they make similar mistakes. For example, a darker cell in "Humans vs Machines" means a higher probability that the three human annotators make the same mistake as the thirteen designs of machine classifiers. (a) A mild diagonal line indicates that humans don't always make the same mistakes. (b) The strong diagonal line indicates that all the machine models tend to make similar mistakes. (c) The diagonal line is weak, indicating that the mistakes made by humans and machines diverge in this comparison.
  • Figure 3: Accuracy as a function of difficulty level: The plots visualize the performance of humans and machines on tasks ranked by difficulty level. The shaded band indicates the range of accuracies for all classifiers, and the solid line represents the average performance. Task difficulty is measured by: (a) machine classifier confidence levels, (b) based on machine agreements, (c) based on human annotation time, (d) and human agreement levels. Plots (a,b) show machine classifier performance heavily correlates to machine difficulty levels, while human performance is significantly less correlated. Plots (c,d) indicate that both human and machine performance is correlated to human-derived difficulty levels.
  • Figure 4: Matching Percentage on balanced set: The figure visualizes the matching percentage between each machine classifier and each other human/machine classifier on a balanced set. The machine classifiers are not all trained with the same training examples, yet the results show machines tend to make judgments that match other machines more than they match humans.
  • Figure 5: Post-hoc teaming: The figure shows the original model performance and the boost from teaming options. "Add human" is teaming with a human classifier, "Add aggre" is teaming with a human classifier that aggregates answers from three humans, and "Add model" is teaming with another machine classifier. We compared all the permutations and visualized the best teaming combinations using a color map. A darker color indicates a greater boost. (a) Oracle mode is the upper bound from perfect teaming, (b) realistic mode is from a simplistic real algorithm. The results show the value of human-machine complementary teaming. Introducing a low-performance human to the teaming system causes more boost than introducing a higher-performance machine classifier.
  • ...and 2 more figures