Visually Grounded Speech Models have a Mutual Exclusivity Bias

Leanne Nortje; Dan Oneaţă; Yevgen Matusevych; Herman Kamper

Visually Grounded Speech Models have a Mutual Exclusivity Bias

Leanne Nortje, Dan Oneaţă, Yevgen Matusevych, Herman Kamper

TL;DR

The paper investigates whether visually grounded speech models exhibit the mutual exclusivity bias when learning from continuous speech paired with images. Using the Matt-Net architecture, it trains on familiar classes and tests with a novel word against a familiar and a novel object, while varying audio-visual initialisations to simulate prior knowledge; the model computes a similarity score $S(a,v)$ via a multimodal attention mechanism and is trained with a contrastive objective. Across initialisations and loss variants, the ME bias consistently emerges, being strongest when both audio and vision priors are present, and it stabilises after roughly 60 training epochs. These findings show that child-like constraints like ME can arise in visually grounded word-learning systems under naturalistic conditions, with implications for understanding representation geometry and the role of priors in multimodal language learning.

Abstract

When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias: a novel word is mapped to a novel object rather than a familiar one. This bias has been studied computationally, but only in models that use discrete word representations as input, ignoring the high variability of spoken words. We investigate the ME bias in the context of visually grounded speech models that learn from natural images and continuous speech audio. Concretely, we train a model on familiar words and test its ME bias by asking it to select between a novel and a familiar object when queried with a novel word. To simulate prior acoustic and visual knowledge, we experiment with several initialisation strategies using pretrained speech and vision networks. Our findings reveal the ME bias across the different initialisation approaches, with a stronger bias in models with more prior (in particular, visual) knowledge. Additional tests confirm the robustness of our results, even when different loss functions are considered.

Visually Grounded Speech Models have a Mutual Exclusivity Bias

TL;DR

via a multimodal attention mechanism and is trained with a contrastive objective. Across initialisations and loss variants, the ME bias consistently emerges, being strongest when both audio and vision priors are present, and it stabilises after roughly 60 training epochs. These findings show that child-like constraints like ME can arise in visually grounded word-learning systems under naturalistic conditions, with implications for understanding representation geometry and the role of priors in multimodal language learning.

Abstract

Paper Structure (18 sections, 3 equations, 7 figures, 6 tables)

This paper contains 18 sections, 3 equations, 7 figures, 6 tables.

Introduction
Related work
Mutual exclusivity in visually grounded speech models
Constructing a speech--image test for mutual exclusivity
A visually grounded speech model
Model
Different initialisation strategies as a proxy for prior knowledge
Mutual exclusivity results
Further analyses
Sanity checks
Why do we see a ME bias?
Finer-grained analysis
How specific are our findings to Matt-Net?
Loss function.
Visual network initialisation.
...and 3 more sections

Figures (7)

Figure 1: Top: A learner is familiarised with a set of objects during training. Middle: At test time, two images are given, one from a familiar class seen during training and the other from an unseen novel class. Bottom: If a learner has a ME bias, then when prompted with a novel spoken query, the novel object ( guitar) would be selected.
Figure 2: Matt-Netnortje_visually_2023-1 consists of a vision network and an audio network. These are connected through a word-to-image attention mechanism. The model outputs a score $S$ indicating the similarity of the speech and image inputs.
Figure 3: Matt-Net's performance over training epochs. The cross indicates the highest overall ME familiar--novel score. The triangles show the scores at the point where the best familiar--familiar score occurs. Results are for the variant of Matt-Net with both CPC and AlexNet initialisations, and performance is averaged over five training runs.
Figure 4: A box plot of similarities for four types of audio--image comparisons with Matt-Net. The audio--image examples of a familiar class have higher similarity (A) than mismatched familiar instances (B). Novel class instances are in-between (C), but they aren't placed as close as the learned familiar classes (A). Novel instances (C) are still closer to each other than to familiar ones (D).
Figure 5: The same analysis as in Figure \ref{['fig:densities']}, but for Matt-Net before training. We can see how similarities are affected through training.
...and 2 more figures

Visually Grounded Speech Models have a Mutual Exclusivity Bias

TL;DR

Abstract

Visually Grounded Speech Models have a Mutual Exclusivity Bias

Authors

TL;DR

Abstract

Table of Contents

Figures (7)