Table of Contents
Fetching ...

BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Shengao Wang, Arjun Chandra, Aoming Liu, Venkatesh Saligrama, Boqing Gong

TL;DR

BabyVLM presents a developmentally inspired, data-efficient approach to pretraining vision-language models by combining a filtered SAYCam corpus with a synthetic child-directed dataset and a compact generative baseline. The framework introduces in-domain benchmarks that reflect early cognitive milestones and demonstrates that carefully curated, developmentally aligned data can yield robust baby-like representations with markedly improved data efficiency over general-purpose training. Ablation and analysis reveal that while synthetic data improves compositional reasoning and in-domain performance, generative models face challenges in full-sentence generation and broader generalization, highlighting the importance of data design and task alignment. Overall, BabyVLM offers a principled template for resource-efficient multimodal learning and provides actionable insights for aligning AI systems with developmental learning processes.

Abstract

Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned--they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.

BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

TL;DR

BabyVLM presents a developmentally inspired, data-efficient approach to pretraining vision-language models by combining a filtered SAYCam corpus with a synthetic child-directed dataset and a compact generative baseline. The framework introduces in-domain benchmarks that reflect early cognitive milestones and demonstrates that carefully curated, developmentally aligned data can yield robust baby-like representations with markedly improved data efficiency over general-purpose training. Ablation and analysis reveal that while synthetic data improves compositional reasoning and in-domain performance, generative models face challenges in full-sentence generation and broader generalization, highlighting the importance of data design and task alignment. Overall, BabyVLM offers a principled template for resource-efficient multimodal learning and provides actionable insights for aligning AI systems with developmental learning processes.

Abstract

Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned--they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.

Paper Structure

This paper contains 23 sections, 12 figures, 8 tables.

Figures (12)

  • Figure 1: We introduce BabyVLM, a developmentally inspired framework derived from SAYCam, consisting of the original SAYCam dataset sullivan2021saycam, a transferred training dataset, a generative baseline VLM, and four evaluation benchmarks.
  • Figure 2: Pipeline for generating the transferred dataset. Step 1: We prompt GPT-4o to check whether an input caption is describing something a child would see in daily life and transfer the original image captions into simpler, child-directed utterances. Step 2: We use the CLIP similarity score as a metric to represent the distance between two images, and then conduct Hungarian matching to select a small subset of the transferred dataset that is visually aligned with SAYCam images.
  • Figure 3: Illustrations of in-domain evaluation benchmarks in the BabyVLM framework. Labeled-S: The category label must be matched to the target referent among 4 candidates. Visual Two-Word Test: The positive phrase must be matched to the image. Positive and negative phrases are generated by GPT-4o. Baby Winoground: The positive and negative phrases must be matched with their corresponding images. Negative images are generated by Stable Diffusion stablediffusion3, with prompts enhanced by GPT-4o. SAYCam Caption: The generated image caption must match the ground truth image caption. All image-caption pairs come from a de-duplicated subset of the SAYCam test split.
  • Figure 4: Examples of the filtered SAYCam dataset
  • Figure 5: Full prompt for transferred dataset creation
  • ...and 7 more figures