Table of Contents
Fetching ...

Do Blind Spots Matter for Word-Referent Mapping? A Computational Study with Infant Egocentric Video

Zekai Shi, Zhixi Cai, Kalin Stefanov

TL;DR

This work tackles learning word–referent mappings from infant-like, egocentric video using a self-supervised, biologically plausible masking approach. By integrating a blind-spot–based masking strategy into a VideoMAEv2 backbone and training with a contrastive video–text objective, the authors demonstrate learnability of cross-modal word–referent associations from temporally extended episodes. They curate a developmentally relevant Video Labeled-S dataset and evaluate across object classification, spatio-temporal understanding, and CVCL-style retrieval, showing that ecologically valid masking can match or exceed traditional random masking in several settings. The findings suggest that temporally continuous, egocentric experience, combined with biologically informed masking, yields robust word–referent mappings and spatio-temporal understanding with practical implications for developmental learning models and multimodal vision-language systems.

Abstract

Typically, children start to learn their first words between 6 and 9 months, linking spoken utterances to their visual referents. Without prior knowledge, a word encountered for the first time can be interpreted in countless ways; it might refer to any of the objects in the environment, their components, or attributes. Using longitudinal, egocentric, and ecologically valid data from the experience of one child, in this work, we propose a self-supervised and biologically plausible strategy to learn strong visual representations. Our masked autoencoder-based visual backbone incorporates knowledge about the blind spot in human eyes to define a novel masking strategy. This mask and reconstruct approach attempts to mimic the way the human brain fills the gaps in the eyes' field of view. This represents a significant shift from standard random masking strategies, which are difficult to justify from a biological perspective. The pretrained encoder is utilized in a contrastive learning-based video-text model capable of acquiring word-referent mappings. Extensive evaluation suggests that the proposed biologically plausible masking strategy is at least as effective as random masking for learning word-referent mappings from cross-situational and temporally extended episodes.

Do Blind Spots Matter for Word-Referent Mapping? A Computational Study with Infant Egocentric Video

TL;DR

This work tackles learning word–referent mappings from infant-like, egocentric video using a self-supervised, biologically plausible masking approach. By integrating a blind-spot–based masking strategy into a VideoMAEv2 backbone and training with a contrastive video–text objective, the authors demonstrate learnability of cross-modal word–referent associations from temporally extended episodes. They curate a developmentally relevant Video Labeled-S dataset and evaluate across object classification, spatio-temporal understanding, and CVCL-style retrieval, showing that ecologically valid masking can match or exceed traditional random masking in several settings. The findings suggest that temporally continuous, egocentric experience, combined with biologically informed masking, yields robust word–referent mappings and spatio-temporal understanding with practical implications for developmental learning models and multimodal vision-language systems.

Abstract

Typically, children start to learn their first words between 6 and 9 months, linking spoken utterances to their visual referents. Without prior knowledge, a word encountered for the first time can be interpreted in countless ways; it might refer to any of the objects in the environment, their components, or attributes. Using longitudinal, egocentric, and ecologically valid data from the experience of one child, in this work, we propose a self-supervised and biologically plausible strategy to learn strong visual representations. Our masked autoencoder-based visual backbone incorporates knowledge about the blind spot in human eyes to define a novel masking strategy. This mask and reconstruct approach attempts to mimic the way the human brain fills the gaps in the eyes' field of view. This represents a significant shift from standard random masking strategies, which are difficult to justify from a biological perspective. The pretrained encoder is utilized in a contrastive learning-based video-text model capable of acquiring word-referent mappings. Extensive evaluation suggests that the proposed biologically plausible masking strategy is at least as effective as random masking for learning word-referent mappings from cross-situational and temporally extended episodes.

Paper Structure

This paper contains 14 sections, 5 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Model architecture, masking strategy, and spatio-temporal attention. (A) Video and utterance are embedded using their corresponding encoder. Video-utterance pairs in the same scene are brought together, and video-utterance pairs in different scenes are pulled away. (B) An example of blind spot masking, where black squares are masked patches, and the second row shows the original frames. (C) An example of tube masking where images with purple blocks are masked patches, and the position of unmasked patches is chosen at random for each clip, but constant for all frames of a clip. (D) A visualization of different regions of blind spot masking, assuming videos are captured with a 109$\times$70 field-of-view camera. Example frames were taken from SomethingSomethingV2goyal2017somethingsomethingvideodatabase
  • Figure 2: Classification Accuracy for Labeled-S dataset. CLIP, DINO ViTB, BlindSpotMAE, and VideoMAEv2 all refer to multimodal models. VideoMAEv2 and BlindSpotMAE-109$\times$70 are used as a pretrained video encoder with ViT-B backbone. Linear decoding refers to evaluating only the BlindSpotMAE video encoder with a linear head on the CVCL evaluation task as shown in \ref{['tab:linear_decode_eval']}