On the robustness of modeling grounded word learning through a child's egocentric input

Wai Keen Vong; Brenden M. Lake

On the robustness of modeling grounded word learning through a child's egocentric input

Wai Keen Vong, Brenden M. Lake

TL;DR

It is demonstrated that networks trained on automatically transcribed data from each child can acquire word-referent mappings, generalizing across videos, children, and image domains, and validate the robustness of multimodal neural networks for grounded word learning.

Abstract

What insights can machine learning bring to understanding human language acquisition? Large language and multimodal models have achieved remarkable capabilities, but their reliance on massive training datasets creates a fundamental mismatch with children, who succeed in acquiring language from comparatively limited input. To help bridge this gap, researchers have increasingly trained neural networks using data similar in quantity and quality to children's input. Taking this approach to the limit, Vong et al. (2024) showed that a multimodal neural network trained on 61 hours of visual and linguistic input extracted from just one child's developmental experience could acquire word-referent mappings. However, whether this approach's success reflects the idiosyncrasies of a single child's experience, or whether it would show consistent and robust learning patterns across multiple children's experiences was not explored. In this article, we applied automated speech transcription methods to the entirety of the SAYCam dataset, consisting of over 500 hours of video data spread across all three children. Using these automated transcriptions, we generated multi-modal vision-and-language datasets for both training and evaluation, and explored a range of neural network configurations to examine the robustness of simulated word learning. Our findings demonstrate that networks trained on automatically transcribed data from each child can acquire word-referent mappings, generalizing across videos, children, and image domains. These results validate the robustness of multimodal neural networks for grounded word learning, while highlighting the individual differences that emerge in how models learn when trained on each child's developmental experiences.

On the robustness of modeling grounded word learning through a child's egocentric input

TL;DR

Abstract

On the robustness of modeling grounded word learning through a child's egocentric input

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)