Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models
Eva Portelance, Siva Reddy, Timothy J. O'Donnell
TL;DR
The paper tackles how children may acquire language through bootstrapping by proposing a unified, joint inference framework over syntax and semantics guided by visual grounding. It introduces a visually-grounded grammar induction model based on a compound PCFG (C-PCFG) with a sentence-wide latent variable $\mathbf{z}$ and a multimodal semantic objective, optimized jointly as $\mathcal{L}_{\text{joint}} = \alpha_1 \mathcal{L}_{\text{syntax}} + \alpha_2 \mathcal{L}_{\text{semantics}}$ with $\alpha_1=\alpha_2=1$. Using the Abstract Scenes dataset, the authors show that joint learning yields superior grammar induction, more realistic lexical-category mappings, and improved interpretation of novel verbs and semantic roles compared to semantics-first, syntax-first, or grounding-only baselines. The findings support a unified account of semantic and syntactic bootstrapping through mutual constraint across modalities, with implications for cognitive science and AI language learning under constrained data and computation. This work highlights joint, multimodal learning as a productive avenue for understanding and modeling language acquisition and generalization.
Abstract
Semantic and syntactic bootstrapping posit that children use their prior knowledge of one linguistic domain, say syntactic relations, to help later acquire another, such as the meanings of new words. Empirical results supporting both theories may tempt us to believe that these are different learning strategies, where one may precede the other. Here, we argue that they are instead both contingent on a more general learning strategy for language acquisition: joint learning. Using a series of neural visually-grounded grammar induction models, we demonstrate that both syntactic and semantic bootstrapping effects are strongest when syntax and semantics are learnt simultaneously. Joint learning results in better grammar induction, realistic lexical category learning, and better interpretations of novel sentence and verb meanings. Joint learning makes language acquisition easier for learners by mutually constraining the hypotheses spaces for both syntax and semantics. Studying the dynamics of joint inference over many input sources and modalities represents an important new direction for language modeling and learning research in both cognitive sciences and AI, as it may help us explain how language can be acquired in more constrained learning settings.
