A model of early word acquisition based on realistic-scale audiovisual naming events
Khazar Khorrami, Okko Räsänen
TL;DR
This work tests whether infants can acquire early word perception and referential mappings from realistic, unannotated audiovisual input using a staged, infant-scale simulation. A VG-W2V2-based architecture learns from self-supervised audio and visual streams, then from cross-modal associations, mirroring developmental timelines from 0 to 12 months. Results show reliable phoneme discrimination and progressive learning of word forms and meanings, with vocabulary growth aligning to infant-reported receptive lexicons under plausible naming rates. The findings support a general statistical learning account for early word comprehension and demonstrate learnability from real-world data without strong linguistic priors, while acknowledging limitations related to speech type, context, and social interaction.
Abstract
Infants gradually learn to parse continuous speech into words and connect names with objects, yet the mechanisms behind development of early word perception skills remain unknown. We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input. We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that solely learns from statistical regularities in unannotated raw speech and pixel-level visual input. Crucially, the quantity of object naming events was carefully designed to match that accessible to infants of comparable ages. Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants. The findings support the viability of general statistical learning for early word perception, demonstrating how learning can operate without assuming any prior linguistic capabilities.
