Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines

Yusen Cai; Bhargava Satya Nunna; Qing Lin; Mengmi Zhang

Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines

Yusen Cai, Bhargava Satya Nunna, Qing Lin, Mengmi Zhang

TL;DR

This work proposes developmentally inspired visual curricula—CATDiet (Color-Diet, Acuity-Diet, Temporality-Diet) and CombDiet—that guide self-supervised learning to emulate infant-like visual development. CATDiet improves robustness to corruptions and reveals neural- and behaviorally aligned signatures (e.g., depth emergence and visual cliff-like responses) without supervision from biology data. CombDiet extends this by initializing SSL with CATDiet before standard training, yielding superior in-domain and out-of-domain object recognition and depth perception across multiple datasets and architectures. The comprehensive 10-dataset benchmark demonstrates that staged, ecologically grounded visual experience fosters robust, generalizable visual intelligence in machines, with practical applicability to real-world vision tasks. All code, data, and models will be publicly released to facilitate replication and further exploration.

Abstract

Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged "visual diets", we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)-collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture-shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm. All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants' visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.

Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines

TL;DR

Abstract

Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)