Table of Contents
Fetching ...

Assessing the alignment between infants' visual and linguistic experience using multimodal language models

Alvin Wei Ming Tan, Jane Yang, Tarun Sepuri, Khai Loong Aw, Robert Z. Sparks, Zi Yin, Virginia A. Marchman, Michael C. Frank, Bria Long

TL;DR

This work investigates how temporally aligned infants' visual and linguistic experiences are during natural learning by using CLIP-based alignment to quantify frame–utterance congruence in egocentric BabyView videos. After validating CLIP alignment against human judgments, the authors apply it at scale to reveal that highly aligned moments are relatively infrequent and vary across individuals, contexts, and utterance content. They find that adult-produced speech, longer utterances, and lemmas that are frequent and concrete tend to yield higher alignment, offering insights into factors shaping early word learning in real-world environments. The study introduces a scalable methodology for probing multimodal learning contexts in development and highlights implications for modeling vocabulary acquisition under sparse and context-dependent visual–linguistic input.

Abstract

Figuring out which objects or concepts words refer to is a central language learning challenge for young children. Most models of this process posit that children learn early object labels from co-occurrences of words and their referents that occur when someone around them talks about an object in the immediate physical environment. But how aligned in time are children's visual and linguistic experiences during everyday learning? To date, answers to this question have been limited by the need for labor-intensive manual annotations of vision-language co-occurrences. Here, we evaluate the use of contrastive language-image pretraining (CLIP) models to automatically characterize vision-language alignment in egocentric videos taken from the infant perspective in home environments. After validating CLIP alignment scores using human alignment judgments, we apply this metric to a large corpus of infant-perspective videos. We show that idealized aligned moments for learning (e.g., "look at the ball" with a ball present in the child's view) are relatively rare in children's everyday experiences compared to modern machine learning datasets, and highlight variability in alignment both within and across children. These findings suggest that infrequent alignment is a constraint for models describing early word learning and offer a new method for investigating children's multimodal environment.

Assessing the alignment between infants' visual and linguistic experience using multimodal language models

TL;DR

This work investigates how temporally aligned infants' visual and linguistic experiences are during natural learning by using CLIP-based alignment to quantify frame–utterance congruence in egocentric BabyView videos. After validating CLIP alignment against human judgments, the authors apply it at scale to reveal that highly aligned moments are relatively infrequent and vary across individuals, contexts, and utterance content. They find that adult-produced speech, longer utterances, and lemmas that are frequent and concrete tend to yield higher alignment, offering insights into factors shaping early word learning in real-world environments. The study introduces a scalable methodology for probing multimodal learning contexts in development and highlights implications for modeling vocabulary acquisition under sparse and context-dependent visual–linguistic input.

Abstract

Figuring out which objects or concepts words refer to is a central language learning challenge for young children. Most models of this process posit that children learn early object labels from co-occurrences of words and their referents that occur when someone around them talks about an object in the immediate physical environment. But how aligned in time are children's visual and linguistic experiences during everyday learning? To date, answers to this question have been limited by the need for labor-intensive manual annotations of vision-language co-occurrences. Here, we evaluate the use of contrastive language-image pretraining (CLIP) models to automatically characterize vision-language alignment in egocentric videos taken from the infant perspective in home environments. After validating CLIP alignment scores using human alignment judgments, we apply this metric to a large corpus of infant-perspective videos. We show that idealized aligned moments for learning (e.g., "look at the ball" with a ball present in the child's view) are relatively rare in children's everyday experiences compared to modern machine learning datasets, and highlight variability in alignment both within and across children. These findings suggest that infrequent alignment is a constraint for models describing early word learning and offer a new method for investigating children's multimodal environment.

Paper Structure

This paper contains 15 sections, 7 figures.

Figures (7)

  • Figure 1: Overview of the pipeline for CLIP alignment scores, showing example frames from a BabyView camera in a naturalistic home environment; CLIP alignment scores are calculated by taking the dot product of the normed embeddings from the CLIP text and vision encoders for each frame associated with an utterance (as transcribed by Distil-Whisper). We then take the maximum alignment score (visualized in purple) for each utterance.
  • Figure 2: Human 4AFC accuracy by condition as a function of CLIP alignment score. Error bars indicate bootstrapped 95% confidence intervals by CLIP score decile, and lines indicate best-fit logistic curves along with their 95% confidence band. Dashed line indicates chance-level performance.
  • Figure 3: Human 4AFC accuracy as a function of CLIP alignment score, for the 8 most frequent activities. Lines indicate best-fit logistic curves. Dashed line indicates chance-level performance.
  • Figure 4: CLIP 4AFC accuracy as a function of CLIP alignment score. Erorr bars indicate bootstrapped 95% confidence intervals by CLIP score decile, and lines indicate best-fit logistic curves along with their 95% confidence band. Dashed colored lines reflect human 4AFC accuracies (as in Figure \ref{['fig:validation']}). Dashed grey line indicates chance-level performance.
  • Figure 5: Proportion of highly-aligned adult utterances as a function of the child's age. Colors reflect individual children, and lines reflect best-fit LOESS curves; the size of the dots scales with the amount of utterances in each age bin.
  • ...and 2 more figures