Table of Contents
Fetching ...

Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play

Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch

TL;DR

This work investigates how caregiver naming during dyadic play shapes toddlers' visual representations. It introduces the Dyadic Play Dataset and a self-supervised learning framework that combines contrastive learning through time with multimodal alignment to language, using developmentally relevant utterance statistics. Results show that plausible naming statistics improve category- and instance-level object representations, with performance highly sensitive to small changes in naming frequency and ambiguity. A key finding is that attention to object names within utterances is crucial for effective visuo-linguistic alignment, supporting the idea that early language guides visual learning in development.

Abstract

Infants' ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers' utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers' utterances, modeled as captions. We propose to model toddlers' learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers' naming utterances can improve toddlers' visual representations.

Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play

TL;DR

This work investigates how caregiver naming during dyadic play shapes toddlers' visual representations. It introduces the Dyadic Play Dataset and a self-supervised learning framework that combines contrastive learning through time with multimodal alignment to language, using developmentally relevant utterance statistics. Results show that plausible naming statistics improve category- and instance-level object representations, with performance highly sensitive to small changes in naming frequency and ambiguity. A key finding is that attention to object names within utterances is crucial for effective visuo-linguistic alignment, supporting the idea that early language guides visual learning in development.

Abstract

Infants' ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers' utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers' utterances, modeled as captions. We propose to model toddlers' learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers' naming utterances can improve toddlers' visual representations.
Paper Structure (21 sections, 3 equations, 6 figures)

This paper contains 21 sections, 3 equations, 6 figures.

Figures (6)

  • Figure 1: A) Top view of the Virtual Home Environment, where blue and red dots, respectively, indicate possible agent and toy positions. The agent is always turned towards a toy position. Toy positions marked 1--3 correspond to sessions 1--3 in C. B) Zoom-in of the scene in A with turquoise lines indicating the agent's field of view. C) Images extracted in a temporally ordered fashion (left to right) for three different "play" sessions of the Dyadic Play Dataset. Text boxes show examples of captions related to the manipulated object (white) or to another object in the background (red). D) Summary of the learning architecture, see Section \ref{['sec:ssl']} for details. Abbreviations: MLP: multi-layer perceptron, MMCL: multimodal contrastive learning, CLTT: contrastive learning through time.
  • Figure 2: A) Category recognition accuracy and B) object instance recognition accuracy for different settings. Oracle represents supervised learning, while an Ideal caregiver consistently names the correct object. Plausible stands for developmentally-relevant utterance statistics, Plausible* is identical but trains the text encoder from scratch.
  • Figure 3: t-SNE visualization of the feature representations extracted by the vision-encoder in different training settings. For better visualization, we show a random subgroup of all classes.
  • Figure 4: Analysis of the impact of the sparsity parameters on A) category recognition and B) object instance recognition. The red points indicate the developmentally-relevant value.
  • Figure 5: Analysis of the impact of naming ambiguity on A) category recognition and B) instance recognition. A high $p_{\rm correct}$ implies low ambiguity. The red points indicate the developmentally-relevant value.
  • ...and 1 more figures