Table of Contents
Fetching ...

Improving generalization by mimicking the human visual diet

Spandan Madan, You Li, Mengmi Zhang, Hanspeter Pfister, Gabriel Kreiman

TL;DR

Biological vision generalizes across real-world transformations while standard computer vision models struggle. The authors argue that the training data, i.e., the visual diet, is a key determinant of generalization and that humans learn from limited, context-rich 3D scenes with diverse transformations. They introduce the Human Visual Diet (HVD) dataset and the HDNet architecture that leverages scene context via a cross-attention transformer and a contrastive loss over real-world transformations. Across synthetic-to-real and real-world evaluations, models trained with this diet outperform standard baselines and other domain-generalization approaches, demonstrating strong generalization to unseen transformations and real-world data.

Abstract

We present a new perspective on bridging the generalization gap between biological and computer vision -- mimicking the human visual diet. While computer vision models rely on internet-scraped datasets, humans learn from limited 3D scenes under diverse real-world transformations with objects in natural context. Our results demonstrate that incorporating variations and contextual cues ubiquitous in the human visual training data (visual diet) significantly improves generalization to real-world transformations such as lighting, viewpoint, and material changes. This improvement also extends to generalizing from synthetic to real-world data -- all models trained with a human-like visual diet outperform specialized architectures by large margins when tested on natural image data. These experiments are enabled by our two key contributions: a novel dataset capturing scene context and diverse real-world transformations to mimic the human visual diet, and a transformer model tailored to leverage these aspects of the human visual diet. All data and source code can be accessed at https://github.com/Spandan-Madan/human_visual_diet.

Improving generalization by mimicking the human visual diet

TL;DR

Biological vision generalizes across real-world transformations while standard computer vision models struggle. The authors argue that the training data, i.e., the visual diet, is a key determinant of generalization and that humans learn from limited, context-rich 3D scenes with diverse transformations. They introduce the Human Visual Diet (HVD) dataset and the HDNet architecture that leverages scene context via a cross-attention transformer and a contrastive loss over real-world transformations. Across synthetic-to-real and real-world evaluations, models trained with this diet outperform standard baselines and other domain-generalization approaches, demonstrating strong generalization to unseen transformations and real-world data.

Abstract

We present a new perspective on bridging the generalization gap between biological and computer vision -- mimicking the human visual diet. While computer vision models rely on internet-scraped datasets, humans learn from limited 3D scenes under diverse real-world transformations with objects in natural context. Our results demonstrate that incorporating variations and contextual cues ubiquitous in the human visual training data (visual diet) significantly improves generalization to real-world transformations such as lighting, viewpoint, and material changes. This improvement also extends to generalizing from synthetic to real-world data -- all models trained with a human-like visual diet outperform specialized architectures by large margins when tested on natural image data. These experiments are enabled by our two key contributions: a novel dataset capturing scene context and diverse real-world transformations to mimic the human visual diet, and a transformer model tailored to leverage these aspects of the human visual diet. All data and source code can be accessed at https://github.com/Spandan-Madan/human_visual_diet.
Paper Structure (29 sections, 2 equations, 10 figures, 6 tables)

This paper contains 29 sections, 2 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Mimicking and exploiting the human visual diet. (a) Comparing human and machine visual diets: The desk in the 3D room is viewed under a variety of real-world transformations which are essential components of the human visual diet. Furthermore, objects are always seen in context of their surroundings. In contrast, sample images of internet-scraped desks which constitute the machine visual diet do not contain these real-world transformations, or scene context. (b) Mimicking the human visual diet by introducing disentangled lighting, material, and viewpoint changes to a 3D scene where objects are shown in context. (c) Exploiting the human visual diet by using a two-stream architecture which reasons over both target object and its surrounding scene context.
  • Figure 2: Datasets with real-world transformations. (a) Sample images from the Human Visual Diet dataset: We created 15 photo-realistic domains with three, disentangled real-world transformations---lighting, material, and viewpoint changes. Each 3D scene was created by reconstructing an existing ScanNet dai2017scannet scene using the OpenRooms framework li2020openrooms, followed by introduction of controlled changes in scene parameters before rendering these images. (b) Sample images from the Semantic-iLab dataset: We modify the existing iLab dataset borji2016ilab by augmenting images with changes in lighting and material. These changes are achieved by modifying the white balance and using AdaIN huang2017adain based style transfer, respectively.
  • Figure 3: Human Visual Diet leads to significantly improved generalization across real-world transformations.((a) Existing models struggle to generalize across real-world transformations, especially material and viewpoint changes. This result holds for both HVD and Semantic-iLab datasets. (b) Increasing real-world transformational diversity leads to a significant increase in generalization performance for all transformations (lighting, viewpoint and materials) for both datasets. (c) HDNet leverages scene context resulting in substantially better generalization than seminal domain generalization architectures like ERM blanchard2017domainmtl, IRM arjovsky2019invariant. HDNet is designed to incorporate scene context into visual recognition, by using a two-stream architecture to reason over the target object and scene context simultaneously. In contrast, above mentioned state-of-the-art approaches for domain generalization are single stream architectures that do not leverage scene context. HDNet also beats a suite of additional domain generalization baselines presented in Table \ref{['table:dg_benchmarks']}. The closest performing baseline is another context-aware kim2021selfreg model (CRTNetbomatter2021pigs), and our proposed model beats theses baselines for all three transformations with statistical significance. For all plots, statistical significance is evaluated using a two-sample t-test, and an $^*$ indicates a p-value lower than the threshold of $0.05$. See methods for additional details.
  • Figure 4: Data post-processing does not match gains from collecting data mimicking the human visual diet. (a) Models trained 80% real-world transformational diversity (RWTD) significantly outperform modesl trained with 20% along with traditional data augmentation. This is true for all transformations (lighting, material, and viewpoint) across both HVD and Semantic-iLab datasets. Number of images is held constant in these experiments. (b) Sample images from style transfer domains created using AdaIn huang2017adain, alongside accuracies of models trained with these domains. Models trained on style transfer domains generalize significantly worse than those trained with material diversity. (c) Generalization from one transformation to another (asymmetric diversity) does not help as much as training with the correct transformation---best generalization to unseen materials is achieved when material diversity is added to the training data. For generalizing to unseen light and viewpoint changes as well, training with the corresponding real-world diversity helps the most.
  • Figure 5: Utility of the human visual diet in generalizing from synthetic to real-world, natural image data. (a) Sample synthetic images from the HVD dataset used for training the model, and the corresponding real-world natural image from ScanNet used for testing. (b) Human Visual Diet enables substantially better generalization from synthetic to natural image data. Our approach, which mimics and effectively utilizes transformational diversity and scene context leads to better performance than all other baselines.
  • ...and 5 more figures