Table of Contents
Fetching ...

On the robustness of modeling grounded word learning through a child's egocentric input

Wai Keen Vong, Brenden M. Lake

TL;DR

It is demonstrated that networks trained on automatically transcribed data from each child can acquire word-referent mappings, generalizing across videos, children, and image domains, and validate the robustness of multimodal neural networks for grounded word learning.

Abstract

What insights can machine learning bring to understanding human language acquisition? Large language and multimodal models have achieved remarkable capabilities, but their reliance on massive training datasets creates a fundamental mismatch with children, who succeed in acquiring language from comparatively limited input. To help bridge this gap, researchers have increasingly trained neural networks using data similar in quantity and quality to children's input. Taking this approach to the limit, Vong et al. (2024) showed that a multimodal neural network trained on 61 hours of visual and linguistic input extracted from just one child's developmental experience could acquire word-referent mappings. However, whether this approach's success reflects the idiosyncrasies of a single child's experience, or whether it would show consistent and robust learning patterns across multiple children's experiences was not explored. In this article, we applied automated speech transcription methods to the entirety of the SAYCam dataset, consisting of over 500 hours of video data spread across all three children. Using these automated transcriptions, we generated multi-modal vision-and-language datasets for both training and evaluation, and explored a range of neural network configurations to examine the robustness of simulated word learning. Our findings demonstrate that networks trained on automatically transcribed data from each child can acquire word-referent mappings, generalizing across videos, children, and image domains. These results validate the robustness of multimodal neural networks for grounded word learning, while highlighting the individual differences that emerge in how models learn when trained on each child's developmental experiences.

On the robustness of modeling grounded word learning through a child's egocentric input

TL;DR

It is demonstrated that networks trained on automatically transcribed data from each child can acquire word-referent mappings, generalizing across videos, children, and image domains, and validate the robustness of multimodal neural networks for grounded word learning.

Abstract

What insights can machine learning bring to understanding human language acquisition? Large language and multimodal models have achieved remarkable capabilities, but their reliance on massive training datasets creates a fundamental mismatch with children, who succeed in acquiring language from comparatively limited input. To help bridge this gap, researchers have increasingly trained neural networks using data similar in quantity and quality to children's input. Taking this approach to the limit, Vong et al. (2024) showed that a multimodal neural network trained on 61 hours of visual and linguistic input extracted from just one child's developmental experience could acquire word-referent mappings. However, whether this approach's success reflects the idiosyncrasies of a single child's experience, or whether it would show consistent and robust learning patterns across multiple children's experiences was not explored. In this article, we applied automated speech transcription methods to the entirety of the SAYCam dataset, consisting of over 500 hours of video data spread across all three children. Using these automated transcriptions, we generated multi-modal vision-and-language datasets for both training and evaluation, and explored a range of neural network configurations to examine the robustness of simulated word learning. Our findings demonstrate that networks trained on automatically transcribed data from each child can acquire word-referent mappings, generalizing across videos, children, and image domains. These results validate the robustness of multimodal neural networks for grounded word learning, while highlighting the individual differences that emerge in how models learn when trained on each child's developmental experiences.

Paper Structure

This paper contains 25 sections, 4 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Evaluation frames from Labeled-S-V2, Labeled-A and Labeled-Y. Here we present three randomly selected evaluation frames from four different evaluation categories (ball, toy, book and cereal), with each row indicating frames derived from each of the three children in the SAYCam dataset.
  • Figure 2: Model architectures. We explore three different multi-modal architectures. The first two architectures utilize a contrastive loss with either a simple Embedding layer with averaging across words (CVCL), or a 2-layer Transformer Decoder (CVCL+T). The third is also a 2-layer Transformer Decoder, but incorporates an additional language modeling loss head (CVCL+T+LM), with an additional weight parameter to balance the two losses. All models use a pre-trained Vision Transformer as their vision encoder, trained only from the visual data from each child, along with a learned, absolute positional embedding scheme in the language encoder.
  • Figure 3: Visualization of the contrastive objective. Given multiple pairs of image frames and utterances, we obtain embeddings via their respective encoders. The contrastive objective aims to maximize the cosine similarity of these matched embeddings (shown in green), while also minimizing the cosine similarity of mismatched embeddings (shown in red).
  • Figure 4: Labeled-S Classification Performance between CVCL trained with manual transcriptions (S-Manual-2022) vs. trained with Whisper transcriptions (S-Whisper-2022). Note that the version of CVCL in the right plot utilizes a pre-trained vision transformer (ViT) instead of the CNN (ResNeXt), as well as a positional encoding scheme in the language encoder. Models were trained with three random seeds. Error bars show bootstrapped 95% confidence intervals over category-level accuracies.
  • Figure 5: Labeled-S Classification Performance between CVCL trained on different combinations of Whisper transcriptions. The left-most bar (S-Whisper-2022, same as Figure \ref{['fig:S-Whisper-2022']}), represents classification performance when trained on transcriptions from the set of previously manually transcribed videos. The middle bar (S-Whisper-Disjoint) represents performance when trained only using transcriptions from new and non-overlapping videos from baby S. Finally, the right-most bar (S-Whisper) represents performance when trained using all of the combined Whisper-transcribed data from original and new videos. Models were trained with three random seeds. Error bars show bootstrapped 95% confidence intervals over category-level accuracies.
  • ...and 5 more figures