BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning
Shengao Wang, Arjun Chandra, Aoming Liu, Venkatesh Saligrama, Boqing Gong
TL;DR
BabyVLM presents a developmentally inspired, data-efficient approach to pretraining vision-language models by combining a filtered SAYCam corpus with a synthetic child-directed dataset and a compact generative baseline. The framework introduces in-domain benchmarks that reflect early cognitive milestones and demonstrates that carefully curated, developmentally aligned data can yield robust baby-like representations with markedly improved data efficiency over general-purpose training. Ablation and analysis reveal that while synthetic data improves compositional reasoning and in-domain performance, generative models face challenges in full-sentence generation and broader generalization, highlighting the importance of data design and task alignment. Overall, BabyVLM offers a principled template for resource-efficient multimodal learning and provides actionable insights for aligning AI systems with developmental learning processes.
Abstract
Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned--they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.
