Table of Contents
Fetching ...

Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models

Eva Portelance, Siva Reddy, Timothy J. O'Donnell

TL;DR

The paper tackles how children may acquire language through bootstrapping by proposing a unified, joint inference framework over syntax and semantics guided by visual grounding. It introduces a visually-grounded grammar induction model based on a compound PCFG (C-PCFG) with a sentence-wide latent variable $\mathbf{z}$ and a multimodal semantic objective, optimized jointly as $\mathcal{L}_{\text{joint}} = \alpha_1 \mathcal{L}_{\text{syntax}} + \alpha_2 \mathcal{L}_{\text{semantics}}$ with $\alpha_1=\alpha_2=1$. Using the Abstract Scenes dataset, the authors show that joint learning yields superior grammar induction, more realistic lexical-category mappings, and improved interpretation of novel verbs and semantic roles compared to semantics-first, syntax-first, or grounding-only baselines. The findings support a unified account of semantic and syntactic bootstrapping through mutual constraint across modalities, with implications for cognitive science and AI language learning under constrained data and computation. This work highlights joint, multimodal learning as a productive avenue for understanding and modeling language acquisition and generalization.

Abstract

Semantic and syntactic bootstrapping posit that children use their prior knowledge of one linguistic domain, say syntactic relations, to help later acquire another, such as the meanings of new words. Empirical results supporting both theories may tempt us to believe that these are different learning strategies, where one may precede the other. Here, we argue that they are instead both contingent on a more general learning strategy for language acquisition: joint learning. Using a series of neural visually-grounded grammar induction models, we demonstrate that both syntactic and semantic bootstrapping effects are strongest when syntax and semantics are learnt simultaneously. Joint learning results in better grammar induction, realistic lexical category learning, and better interpretations of novel sentence and verb meanings. Joint learning makes language acquisition easier for learners by mutually constraining the hypotheses spaces for both syntax and semantics. Studying the dynamics of joint inference over many input sources and modalities represents an important new direction for language modeling and learning research in both cognitive sciences and AI, as it may help us explain how language can be acquired in more constrained learning settings.

Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models

TL;DR

The paper tackles how children may acquire language through bootstrapping by proposing a unified, joint inference framework over syntax and semantics guided by visual grounding. It introduces a visually-grounded grammar induction model based on a compound PCFG (C-PCFG) with a sentence-wide latent variable and a multimodal semantic objective, optimized jointly as with . Using the Abstract Scenes dataset, the authors show that joint learning yields superior grammar induction, more realistic lexical-category mappings, and improved interpretation of novel verbs and semantic roles compared to semantics-first, syntax-first, or grounding-only baselines. The findings support a unified account of semantic and syntactic bootstrapping through mutual constraint across modalities, with implications for cognitive science and AI language learning under constrained data and computation. This work highlights joint, multimodal learning as a productive avenue for understanding and modeling language acquisition and generalization.

Abstract

Semantic and syntactic bootstrapping posit that children use their prior knowledge of one linguistic domain, say syntactic relations, to help later acquire another, such as the meanings of new words. Empirical results supporting both theories may tempt us to believe that these are different learning strategies, where one may precede the other. Here, we argue that they are instead both contingent on a more general learning strategy for language acquisition: joint learning. Using a series of neural visually-grounded grammar induction models, we demonstrate that both syntactic and semantic bootstrapping effects are strongest when syntax and semantics are learnt simultaneously. Joint learning results in better grammar induction, realistic lexical category learning, and better interpretations of novel sentence and verb meanings. Joint learning makes language acquisition easier for learners by mutually constraining the hypotheses spaces for both syntax and semantics. Studying the dynamics of joint inference over many input sources and modalities represents an important new direction for language modeling and learning research in both cognitive sciences and AI, as it may help us explain how language can be acquired in more constrained learning settings.
Paper Structure (33 sections, 14 equations, 16 figures, 3 tables)

This paper contains 33 sections, 14 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Example of nonce verb learning experimental paradigm from yuan2012counting, demonstrating indirect evidence for the syntactic bootstrapping hypothesis.
  • Figure 2: Examples image-sentence pairs from the Abstract Scenes dataset zitnick2013bringingzitnick2013learning
  • Figure 3: The joint model architecture
  • Figure 4: Mean Span F1 scores on test sentences by model during learning. Shading represents standard error across 5 runs. Dashed line represents point in time where semantics-first and syntax-first models switch to joint-learning loss function.
  • Figure 5: Examples of (a) induced trees from the joint-learning model and (b) gold parses from the Berkeley Neural Parser. Red boxes highlight discrepancies between predicted and gold trees.
  • ...and 11 more figures