Table of Contents
Fetching ...

A comparison between humans and AI at recognizing objects in unusual poses

Netta Ollikka, Amro Abbas, Andrea Perin, Markku Kilpeläinen, Stéphane Deny

TL;DR

This study directly compares humans and AI on recognizing objects in unusual, out-of-canonical poses using a controlled dataset and parallel psychophysics and machine evaluations. It demonstrates that humans are highly robust under unlimited viewing time, while most pure-vision networks and many vision-language models are brittle to pose changes, with Gemini 1.5 showing notable robustness. Time-limited viewing (e.g., 40 ms) dramatically reduces human performance, aligning it with network performance and implying that extra processing (potentially recurrent and retrospective) drives human resilience. The work highlights distinct error patterns between humans and AI, suggests recurrence and evidence accumulation as possible mechanisms to human robustness, and motivates incorporating time-dependent and cross-system processing into AI models to close the gap in object recognition under challenging poses.

Abstract

Deep learning is closing the gap with human vision on several object recognition benchmarks. Here we investigate this gap for challenging images where objects are seen in unusual poses. We find that humans excel at recognizing objects in such poses. In contrast, state-of-the-art deep networks for vision (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) and state-of-the-art large vision-language models (Claude 3.5, Gemini 1.5, GPT-4) are systematically brittle on unusual poses, with the exception of Gemini showing excellent robustness in that condition. As we limit image exposure time, human performance degrades to the level of deep networks, suggesting that additional mental processes (requiring additional time) are necessary to identify objects in unusual poses. An analysis of error patterns of humans vs. networks reveals that even time-limited humans are dissimilar to feed-forward deep networks. In conclusion, our comparison reveals that humans and deep networks rely on different mechanisms for recognizing objects in unusual poses. Understanding the nature of the mental processes taking place during extra viewing time may be key to reproduce the robustness of human vision in silico.

A comparison between humans and AI at recognizing objects in unusual poses

TL;DR

This study directly compares humans and AI on recognizing objects in unusual, out-of-canonical poses using a controlled dataset and parallel psychophysics and machine evaluations. It demonstrates that humans are highly robust under unlimited viewing time, while most pure-vision networks and many vision-language models are brittle to pose changes, with Gemini 1.5 showing notable robustness. Time-limited viewing (e.g., 40 ms) dramatically reduces human performance, aligning it with network performance and implying that extra processing (potentially recurrent and retrospective) drives human resilience. The work highlights distinct error patterns between humans and AI, suggests recurrence and evidence accumulation as possible mechanisms to human robustness, and motivates incorporating time-dependent and cross-system processing into AI models to close the gap in object recognition under challenging poses.

Abstract

Deep learning is closing the gap with human vision on several object recognition benchmarks. Here we investigate this gap for challenging images where objects are seen in unusual poses. We find that humans excel at recognizing objects in such poses. In contrast, state-of-the-art deep networks for vision (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) and state-of-the-art large vision-language models (Claude 3.5, Gemini 1.5, GPT-4) are systematically brittle on unusual poses, with the exception of Gemini showing excellent robustness in that condition. As we limit image exposure time, human performance degrades to the level of deep networks, suggesting that additional mental processes (requiring additional time) are necessary to identify objects in unusual poses. An analysis of error patterns of humans vs. networks reveals that even time-limited humans are dissimilar to feed-forward deep networks. In conclusion, our comparison reveals that humans and deep networks rely on different mechanisms for recognizing objects in unusual poses. Understanding the nature of the mental processes taking place during extra viewing time may be key to reproduce the robustness of human vision in silico.
Paper Structure (38 sections, 3 equations, 8 figures)

This paper contains 38 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: Example images of the dataset, and corresponding answer choices. Four examples of objects and their three different rotations: (left) upright, (middle) rotated and correctly labeled by EfficientNet, and (right) rotated and incorrectly labeled by EfficientNet. Above each image is shown the correct label and below each image is the alternative label that we selected based on EfficientNet's predictions (see Dataset collection\ref{['sec:dataset-collection']} for details of this selection).
  • Figure 2: Description of the human tests used in this study. a) Task with limited viewing time: First the subject fixates on a cross, then an image is displayed for either 40 ms or 150 ms, followed by a dynamic checkerboard mask shown for 500 ms. Then the subject is asked to choose between two labels for the image (i.e., two-forced-choice task), and has an unlimited time to answer. b) Task with unlimited viewing time: similar test-setting, but now the image and the answer choices are displayed together for an unlimited viewing time and without back-masking.
  • Figure 3: Comparing neural networks and humans at recognizing objects in various poses.Dark grey bars: Average performance of pure vision deep networks (left: upright vs. right: rotated) with 95% confidence intervals (n=5, grey points represent individual network performances). Light grey bars: Average performance of seven large vision-language models (VLMs). Diamonds indicate individual model performances (full diamonds show the best-performing version of the different model classes). Blue bars: Average human performance with unlimited viewing time (n=12, grey points represent individual performances). Orange & Red bars: Average human performance with limited viewing time (150 ms and 40 ms, respectively, n=12, grey points represent individual performances). Chance performance is 50%. Three stars (***) indicate highly significant differences (p$<$0.001), "n.s." for not significant. With unlimited time, humans excel at recognizing rotated objects, while pure vision networks struggle (best: SWAG at 70.1%). GPT-4, Claude and SigLIP models follow the same pattern as the pure vision networks, with a significant drop in accuracy for rotated images compared to upright ones. However, Gemini 1.5 Flash mirrors human performance with unlimited viewing time, achieving 97.7% accuracy on rotated images compared to human accuracy of 98.9%. Limiting human viewing time (40 ms or 150 ms) impairs their ability to recognize rotated objects, substantially more than upright objects, bringing their performance closer to network levels.
  • Figure 4: Error patterns are different for neural networks and time-limited humans. a) Comparison of human and network accuracy for the rotated-correct condition (EfficientNet was correct on these rotations) vs. rotated-incorrect condition (rotations that have failed EfficientNet). Humans show consistent accuracy between rotated-correct and rotated-incorrect conditions. In contrast, networks, including the seven VLMs (represented by diamonds), exhibit a performance drop. However, for Gemini 1.5 Flash, Gemini 1.5 Pro and Claude Opus, the decrease in performance between the two conditions is not as notable. b) Error consistency analysis performed on the 5 neural networks (not including VLMs) and 40 ms time-limited human subjects. 12 human subjects were partitioned into 4 groups of 3, so that every group saw each rotated image exactly once. The darker red cluster for networks (dark red = highly consistent errors) indicates that they have similar patterns of error, which are not shared by human subjects, highlighting that EfficientNet errors transfer better to other networks than to humans. c) Mean error consistencies were calculated by comparing networks with each other and with human subjects (i.e., average computed over the different matrix clusters). A two-tailed unpaired t-test confirms that networks indeed make more consistent errors with each other than with humans (t(28) = 4.0, p = 3.8e-04).
  • Figure 5: Examples of objects in unusual poses, where deep networks for pure vision and 40 ms time-limited humans made similar and differing errors. Above each image is a quantitative score of correct answers and below each image are the given answer choices, correct answer being the first. a) Images where all networks and humans correctly labeled the object. b) Images where all networks failed to correctly label the objects, but where humans mostly chose the correct answer. c) Images, where all the networks correctly labeled the objects, but where most humans failed. d) Images, where both networks and humans were mostly not able to correctly label the object.
  • ...and 3 more figures