Table of Contents
Fetching ...

Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines

Yusen Cai, Bhargava Satya Nunna, Qing Lin, Mengmi Zhang

TL;DR

This work proposes developmentally inspired visual curricula—CATDiet (Color-Diet, Acuity-Diet, Temporality-Diet) and CombDiet—that guide self-supervised learning to emulate infant-like visual development. CATDiet improves robustness to corruptions and reveals neural- and behaviorally aligned signatures (e.g., depth emergence and visual cliff-like responses) without supervision from biology data. CombDiet extends this by initializing SSL with CATDiet before standard training, yielding superior in-domain and out-of-domain object recognition and depth perception across multiple datasets and architectures. The comprehensive 10-dataset benchmark demonstrates that staged, ecologically grounded visual experience fosters robust, generalizable visual intelligence in machines, with practical applicability to real-world vision tasks. All code, data, and models will be publicly released to facilitate replication and further exploration.

Abstract

Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged "visual diets", we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)-collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture-shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm. All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants' visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.

Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines

TL;DR

This work proposes developmentally inspired visual curricula—CATDiet (Color-Diet, Acuity-Diet, Temporality-Diet) and CombDiet—that guide self-supervised learning to emulate infant-like visual development. CATDiet improves robustness to corruptions and reveals neural- and behaviorally aligned signatures (e.g., depth emergence and visual cliff-like responses) without supervision from biology data. CombDiet extends this by initializing SSL with CATDiet before standard training, yielding superior in-domain and out-of-domain object recognition and depth perception across multiple datasets and architectures. The comprehensive 10-dataset benchmark demonstrates that staged, ecologically grounded visual experience fosters robust, generalizable visual intelligence in machines, with practical applicability to real-world vision tasks. All code, data, and models will be publicly released to facilitate replication and further exploration.

Abstract

Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged "visual diets", we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)-collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture-shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm. All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants' visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.

Paper Structure

This paper contains 17 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of developmental visual diets and overview of evaluation benchmarks. The left panel depicts stages of infant development over time along with corresponding characteristics of visual perception. The 3 axes below represent the key regularities of infant visual development that underpin our work: Color, Acuity, and Temporality. The Color diet (CDiet) models the progression from color-degraded to richly chromatic scenes as color vision matures. The Acuity diet (ADiet) reflects the transition from blurry to sharp perception as visual resolution improves. The Temporality diet (TDiet) captures infants’ exposure to smoothly evolving visual scenes over short time windows. In the right panel, regular font indicates in-domain tasks, while italic font denotes out-of-domain tasks. Panels (a–d) assess object recognition: (a) clean image recognition on CO3D reizenstein21co3d, SAYCam orhan2020self, and ImageNet ridnik2021in21k; (b) corrupted image recognition across 15 corruption types following ImageNet-C hendrycks2019benchmarking; (c) shape-bias evaluation using the Texture–Shape Cue Conflict dataset geirhos2018imagenet, testing whether classification aligns with shapes (green) or textures (red); and (d) silhouette recognition using the Silhouettes-Only dataset geirhos2018imagenet. Panels (e–f) evaluate depth perception: (e) judging whether the green arrow is closer than the red ball in the 3D-PC dataset linsley20243d; and (f) predicting which side (green arrow or red ball) appears closer from the infant egocentric perspective in the Visual Cliff paradigm gibson1960viscliff.
  • Figure 2: Overview of our proposed developmental visual diets.CATDiet (panels a–c) integrates 3 individual visual diets detailed in \ref{['sec:CAT']}: (a) CDiet, in which image saturation gradually increases as chromatic information is progressively introduced throughout training; (b) ADiet, where the standard deviation $\sigma$ of Gaussian blur kernels decreases, enhancing spatial details over time; and (c) TDiet, which encourages representations of adjacent views of the same object to be closer, capturing temporal continuity in object-centric videos. (d) CombDiet extends CATDiet to a more general setting. In the first phase, CATDiet serves as a warm-up stage spanning the initial 30% of training epochs. In the second phase, CombDiet transitions to the Standard Diet (SDiet) while retaining Temporality-Diet (TDiet). SDiet corresponds to the standard data augmentation pipeline used in conventional SSL training regimes. We evaluate CombDiet using 2 representative SSL methods (SimCLR chen2020simple and DINO caron2021emerging) with 2 widely adopted backbones: ResNet he2016deep and ViT dosovitskiy2020image.
  • Figure 3: Object recognition performance on CO3D (clean) and CO3D-C (corrupted) datasets for SimCLR-ResNet pretrained on CATDiet and its individual diets. Bars show mCE ($\downarrow$, left axis), where blue and gray bars denote our proposed visual diets and their corresponding baselines; the dashed red line indicates Acc ($\uparrow$, right axis). Error bars represent the standard error of mCE over three runs. The four panels correspond to different attributes of the proposed visual diets: Color, Acuity, Temporality, and their combination (see \ref{['sec:ablation']}).
  • Figure 4: Signature developmental patterns observed in SimCLR-ResNet pretrained on CATDiet from CO3D (a), 3D-PC (b), and IEVC (c). (a) The trace of the Fisher Information Matrix (FIM) for the SSL model gradients achille2018critical is plotted across pretraining epochs, capturing the sensitivity of network outputs to small weight perturbations. The inset shows synaptic density changes in macaque primary visual cortex (V1) rakic1986concurrent, highlighting a similar rise-and-fall pattern for the model pretrained on CATDiet (blue) compared to FIM changes in CAT-SHF (red). (b) Binary classification accuracy on a depth-order task as a function of pretraining epochs. Blue and red curves correspond to CATDiet and SHF, respectively. The shaded region marks the period of rapid accuracy increase in dAcc for CATDiet. (c) Simulated Visual Cliff experiment. The top row shows egocentric views from an infant’s perspective crawling on a glass platform (see \ref{['sec:datasets']}). The table below summarizes model responses for CATDiet and CAT-SHF to the binary question "Is the green arrow closer than the red ball?" ("yes" indicates that the green arrow is closer).
  • Figure S1: Object recognition performance on CO3D (clean) and CO3D-C (corrupted) datasets for three SSL models pretrained on CATDiet and its individual diets. Panels (a–c) show results for SimCLR-ViT, DINO-ResNet, and DINO-ViT, respectively. Within each panel, bars show mCE ($\downarrow$, left axis) for our diets (blue) and their baselines (gray); the dashed red line indicates Acc ($\uparrow$, right axis). Error bars reflect the standard error of mCE over three runs. Within each panel, the four groups correspond to different attributes of the proposed visual diets: Color, Acuity, Temporality, and their combination.