Table of Contents
Fetching ...

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Ziqiao Ma, Jiayi Pan, Joyce Chai

TL;DR

This work addresses how to ground language in vision and enable fast, open-world word learning. It introduces Grounded Open Vocabulary Acquisition (GOVA) and the Object-Oriented BERT (OctoBERT), a visually grounded language model trained with masked language modeling, object localization, and word-region grounding objectives to align linguistic and perceptual representations. Empirical results show grounded pre-training yields data-efficient learning for both seen and unseen words, including word-agnostic grounding for unseen terms and rapid few-shot acquisition, with analysis linking model behavior to linguistic, perceptual, and psycho-linguistic predictors. Together, these findings demonstrate that grounding augments word learning in vision-language models and point to scalable pathways for open-world grounded language agents, while highlighting cognitive and ethical considerations for future work.

Abstract

The ability to connect language units to their referents in the physical world, referred to as grounding, is crucial to learning and understanding grounded meanings of words. While humans demonstrate fast mapping in new word learning, it remains unclear whether modern vision-language models can truly represent language with their grounded meanings and how grounding may further bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning. As an initial attempt, we propose object-oriented BERT (OctoBERT), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective. Through extensive experiments and analysis, we demonstrate that OctoBERT is a more coherent and fast grounded word learner, and that the grounding ability acquired during pre-training helps the model to learn unseen words more rapidly and robustly. Our code is available at https://github.com/sled-group/world-to-words

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

TL;DR

This work addresses how to ground language in vision and enable fast, open-world word learning. It introduces Grounded Open Vocabulary Acquisition (GOVA) and the Object-Oriented BERT (OctoBERT), a visually grounded language model trained with masked language modeling, object localization, and word-region grounding objectives to align linguistic and perceptual representations. Empirical results show grounded pre-training yields data-efficient learning for both seen and unseen words, including word-agnostic grounding for unseen terms and rapid few-shot acquisition, with analysis linking model behavior to linguistic, perceptual, and psycho-linguistic predictors. Together, these findings demonstrate that grounding augments word learning in vision-language models and point to scalable pathways for open-world grounded language agents, while highlighting cognitive and ethical considerations for future work.

Abstract

The ability to connect language units to their referents in the physical world, referred to as grounding, is crucial to learning and understanding grounded meanings of words. While humans demonstrate fast mapping in new word learning, it remains unclear whether modern vision-language models can truly represent language with their grounded meanings and how grounding may further bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning. As an initial attempt, we propose object-oriented BERT (OctoBERT), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective. Through extensive experiments and analysis, we demonstrate that OctoBERT is a more coherent and fast grounded word learner, and that the grounding ability acquired during pre-training helps the model to learn unseen words more rapidly and robustly. Our code is available at https://github.com/sled-group/world-to-words
Paper Structure (55 sections, 7 equations, 9 figures, 7 tables)

This paper contains 55 sections, 7 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Even when the term "incinerator" (highlighted yellow) is new to human learners, they can still locate the most likely referent (indicated by the yellow bounding box) in the perceived world by grounding.
  • Figure 2: An instance of the word grounding task. Models are tasked to predict the missing word boat and localize the corresponding smaller yellow boat in the image coherently.
  • Figure 3: An illustration of the few-shot new word learning paradigm. The model first pre-trains on a grounding dataset with a set of base words ($\mathcal{V}_{\textrm{seen}}$), and then attempts to acquire a set of unseen words ($\mathcal{V}_{\textrm{unseen}}$) in a small number of raw text-image pairs. Tests are performed after each training session.
  • Figure 4: An overview of OctoBERT, a visually grounded language model pre-trained with three objectives: masked language modeling (MLM), object localization (OL), and grounding through word-region alignment (WRA).
  • Figure 5: Although the word "elephant" is unseen to OctoBERT, the model is still able to localize the object in the image referred to by the MASK.
  • ...and 4 more figures