Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models

Eva Portelance; Siva Reddy; Timothy J. O'Donnell

Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models

Eva Portelance, Siva Reddy, Timothy J. O'Donnell

TL;DR

The paper tackles how children may acquire language through bootstrapping by proposing a unified, joint inference framework over syntax and semantics guided by visual grounding. It introduces a visually-grounded grammar induction model based on a compound PCFG (C-PCFG) with a sentence-wide latent variable $\mathbf{z}$ and a multimodal semantic objective, optimized jointly as $\mathcal{L}_{\text{joint}} = \alpha_1 \mathcal{L}_{\text{syntax}} + \alpha_2 \mathcal{L}_{\text{semantics}}$ with $\alpha_1=\alpha_2=1$. Using the Abstract Scenes dataset, the authors show that joint learning yields superior grammar induction, more realistic lexical-category mappings, and improved interpretation of novel verbs and semantic roles compared to semantics-first, syntax-first, or grounding-only baselines. The findings support a unified account of semantic and syntactic bootstrapping through mutual constraint across modalities, with implications for cognitive science and AI language learning under constrained data and computation. This work highlights joint, multimodal learning as a productive avenue for understanding and modeling language acquisition and generalization.

Abstract

Semantic and syntactic bootstrapping posit that children use their prior knowledge of one linguistic domain, say syntactic relations, to help later acquire another, such as the meanings of new words. Empirical results supporting both theories may tempt us to believe that these are different learning strategies, where one may precede the other. Here, we argue that they are instead both contingent on a more general learning strategy for language acquisition: joint learning. Using a series of neural visually-grounded grammar induction models, we demonstrate that both syntactic and semantic bootstrapping effects are strongest when syntax and semantics are learnt simultaneously. Joint learning results in better grammar induction, realistic lexical category learning, and better interpretations of novel sentence and verb meanings. Joint learning makes language acquisition easier for learners by mutually constraining the hypotheses spaces for both syntax and semantics. Studying the dynamics of joint inference over many input sources and modalities represents an important new direction for language modeling and learning research in both cognitive sciences and AI, as it may help us explain how language can be acquired in more constrained learning settings.

Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models

TL;DR

and a multimodal semantic objective, optimized jointly as

with

. Using the Abstract Scenes dataset, the authors show that joint learning yields superior grammar induction, more realistic lexical-category mappings, and improved interpretation of novel verbs and semantic roles compared to semantics-first, syntax-first, or grounding-only baselines. The findings support a unified account of semantic and syntactic bootstrapping through mutual constraint across modalities, with implications for cognitive science and AI language learning under constrained data and computation. This work highlights joint, multimodal learning as a productive avenue for understanding and modeling language acquisition and generalization.

Abstract

Paper Structure (33 sections, 14 equations, 16 figures, 3 tables)

This paper contains 33 sections, 14 equations, 16 figures, 3 tables.

Introduction
Linguistic bootstrapping debates
Grammar induction models and linguistic bootstrapping
Our proposal
The dataset
Test-train splits
The joint-learning model
The syntactic objective
The semantic objective
Joint-learning model:
Model ablations and baselines
Semantics-first model:
Syntax-first model:
Visual-labels model:
Experiment 1: Semantic bootstrapping and joint learning
...and 18 more sections

Figures (16)

Figure 1: Example of nonce verb learning experimental paradigm from yuan2012counting, demonstrating indirect evidence for the syntactic bootstrapping hypothesis.
Figure 2: Examples image-sentence pairs from the Abstract Scenes dataset zitnick2013bringingzitnick2013learning
Figure 3: The joint model architecture
Figure 4: Mean Span F1 scores on test sentences by model during learning. Shading represents standard error across 5 runs. Dashed line represents point in time where semantics-first and syntax-first models switch to joint-learning loss function.
Figure 5: Examples of (a) induced trees from the joint-learning model and (b) gold parses from the Berkeley Neural Parser. Red boxes highlight discrepancies between predicted and gold trees.
...and 11 more figures

Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models

TL;DR

Abstract

Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)