Improving generalization by mimicking the human visual diet
Spandan Madan, You Li, Mengmi Zhang, Hanspeter Pfister, Gabriel Kreiman
TL;DR
Biological vision generalizes across real-world transformations while standard computer vision models struggle. The authors argue that the training data, i.e., the visual diet, is a key determinant of generalization and that humans learn from limited, context-rich 3D scenes with diverse transformations. They introduce the Human Visual Diet (HVD) dataset and the HDNet architecture that leverages scene context via a cross-attention transformer and a contrastive loss over real-world transformations. Across synthetic-to-real and real-world evaluations, models trained with this diet outperform standard baselines and other domain-generalization approaches, demonstrating strong generalization to unseen transformations and real-world data.
Abstract
We present a new perspective on bridging the generalization gap between biological and computer vision -- mimicking the human visual diet. While computer vision models rely on internet-scraped datasets, humans learn from limited 3D scenes under diverse real-world transformations with objects in natural context. Our results demonstrate that incorporating variations and contextual cues ubiquitous in the human visual training data (visual diet) significantly improves generalization to real-world transformations such as lighting, viewpoint, and material changes. This improvement also extends to generalizing from synthetic to real-world data -- all models trained with a human-like visual diet outperform specialized architectures by large margins when tested on natural image data. These experiments are enabled by our two key contributions: a novel dataset capturing scene context and diverse real-world transformations to mimic the human visual diet, and a transformer model tailored to leverage these aspects of the human visual diet. All data and source code can be accessed at https://github.com/Spandan-Madan/human_visual_diet.
