Table of Contents
Fetching ...

A model of early word acquisition based on realistic-scale audiovisual naming events

Khazar Khorrami, Okko Räsänen

TL;DR

This work tests whether infants can acquire early word perception and referential mappings from realistic, unannotated audiovisual input using a staged, infant-scale simulation. A VG-W2V2-based architecture learns from self-supervised audio and visual streams, then from cross-modal associations, mirroring developmental timelines from 0 to 12 months. Results show reliable phoneme discrimination and progressive learning of word forms and meanings, with vocabulary growth aligning to infant-reported receptive lexicons under plausible naming rates. The findings support a general statistical learning account for early word comprehension and demonstrate learnability from real-world data without strong linguistic priors, while acknowledging limitations related to speech type, context, and social interaction.

Abstract

Infants gradually learn to parse continuous speech into words and connect names with objects, yet the mechanisms behind development of early word perception skills remain unknown. We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input. We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that solely learns from statistical regularities in unannotated raw speech and pixel-level visual input. Crucially, the quantity of object naming events was carefully designed to match that accessible to infants of comparable ages. Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants. The findings support the viability of general statistical learning for early word perception, demonstrating how learning can operate without assuming any prior linguistic capabilities.

A model of early word acquisition based on realistic-scale audiovisual naming events

TL;DR

This work tests whether infants can acquire early word perception and referential mappings from realistic, unannotated audiovisual input using a staged, infant-scale simulation. A VG-W2V2-based architecture learns from self-supervised audio and visual streams, then from cross-modal associations, mirroring developmental timelines from 0 to 12 months. Results show reliable phoneme discrimination and progressive learning of word forms and meanings, with vocabulary growth aligning to infant-reported receptive lexicons under plausible naming rates. The findings support a general statistical learning account for early word comprehension and demonstrate learnability from real-world data without strong linguistic priors, while acknowledging limitations related to speech type, context, and social interaction.

Abstract

Infants gradually learn to parse continuous speech into words and connect names with objects, yet the mechanisms behind development of early word perception skills remain unknown. We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input. We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that solely learns from statistical regularities in unannotated raw speech and pixel-level visual input. Crucially, the quantity of object naming events was carefully designed to match that accessible to infants of comparable ages. Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants. The findings support the viability of general statistical learning for early word perception, demonstrating how learning can operate without assuming any prior linguistic capabilities.
Paper Structure (23 sections, 1 equation, 6 figures, 3 tables)

This paper contains 23 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An illustration of the language learning challenges encountered in realistic settings. Speech is heard in a particular visual context and the learner has to learn to identify words from running speech, to extract visual objects from the scenes, and then to relate correct words with correct objects by accumulating evidence across multiple individually ambiguous naming events and scenes. This is complicated by the fact that there are no universal cues to word boundaries and each spoken realization of a word has a different acoustic waveform. Utterances also consist of many words that can refer to many different objects in the visual scene, and where only some of the spoken words (highlighted with colors) may be relevant with respect to the current scene. Moreover, only some aspects of the images are named in the utterances, and visual objects of a category also look different across different situations. Finally, one word can refer to many entities of the same visual category (e.g., "cake" in the left panel), and several words can refer to the same physical entity (“bottle” and “milk” for a “bottle of milk” in the left panel). Images and waveforms were extracted from SpokenCOCO dataset hsu2021text.
  • Figure 2: Modeling approach. A, Illustration of the modeled timeline in terms of simulated infant months and total number of referentially relevant naming events (across all word types). B, Illustration of frequencies at which visual objects and words co-occur in the audiovisual input, with statistics derived from clerkin2019everydayclerkin2022real but adapted to visual object categories of COCO dataset lin2014microsoft. Examples of the highest, medium, and the lowest frequency words are shown. The sets of spoken words used to refer to each of the visual categories are shown in parentheses. C, High-level overview of the computational model. Visual and auditory inputs are initially processed in separate encoder networks that first learn in a self-supervised manner without data labels or multimodal influence. After 6 months, cross-modal learning starts to operate in an audiovisual associative network using concurrent sights and sounds together with continued learning in the auditory encoder.
  • Figure 3: Words learning competence in a simulated infant of age 0-12 months-old.a, Word discrimination and word meaning learning for different trained models. b,c, The distribution of word meaning scores over various object categories.
  • Figure 4: Vocabulary size of the model and real infants in terms of word comprehension. Black line denotes median vocabulary size in the infant population and grey circles show vocabulary sizes of individual infants (data from CDI vocabulary norms of North American English-learning infants frank2017wordbank). Three different vocabulary size curves are shown for the model, depending on how accurately the model needs to identify the correct visual referent of a spoken word in the two-alternative forced-choice task used for word meaning scores. Blue line: above-chance identification of the correct referent (upper bound for vocabulary size). Red line: correct in 2/3 of the trials. Yellow line: correct in 4/5 of the trials (lower bound).
  • Figure 5: The training and validation loss curves of the wav2vec 2.0 speech processing model ($loss_{\text{AUD}}$) over 100 epochs for the 6-month age bin.
  • ...and 1 more figures