A comparison between humans and AI at recognizing objects in unusual poses
Netta Ollikka, Amro Abbas, Andrea Perin, Markku Kilpeläinen, Stéphane Deny
TL;DR
This study directly compares humans and AI on recognizing objects in unusual, out-of-canonical poses using a controlled dataset and parallel psychophysics and machine evaluations. It demonstrates that humans are highly robust under unlimited viewing time, while most pure-vision networks and many vision-language models are brittle to pose changes, with Gemini 1.5 showing notable robustness. Time-limited viewing (e.g., 40 ms) dramatically reduces human performance, aligning it with network performance and implying that extra processing (potentially recurrent and retrospective) drives human resilience. The work highlights distinct error patterns between humans and AI, suggests recurrence and evidence accumulation as possible mechanisms to human robustness, and motivates incorporating time-dependent and cross-system processing into AI models to close the gap in object recognition under challenging poses.
Abstract
Deep learning is closing the gap with human vision on several object recognition benchmarks. Here we investigate this gap for challenging images where objects are seen in unusual poses. We find that humans excel at recognizing objects in such poses. In contrast, state-of-the-art deep networks for vision (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) and state-of-the-art large vision-language models (Claude 3.5, Gemini 1.5, GPT-4) are systematically brittle on unusual poses, with the exception of Gemini showing excellent robustness in that condition. As we limit image exposure time, human performance degrades to the level of deep networks, suggesting that additional mental processes (requiring additional time) are necessary to identify objects in unusual poses. An analysis of error patterns of humans vs. networks reveals that even time-limited humans are dissimilar to feed-forward deep networks. In conclusion, our comparison reveals that humans and deep networks rely on different mechanisms for recognizing objects in unusual poses. Understanding the nature of the mental processes taking place during extra viewing time may be key to reproduce the robustness of human vision in silico.
