Table of Contents
Fetching ...

Visual Grounding Helps Learn Word Meanings in Low-Data Regimes

Chengxu Zhuang, Evelina Fedorenko, Jacob Andreas

TL;DR

This study probes whether visual grounding improves word learning in neural LMs by contrasting grounded architectures (CLIP, GIT, Flamingo) with language-only baselines across dataset scales. Using a comprehensive battery of word-learning benchmarks and brain-alignment measures, the authors find limited, data-size-dependent benefits from visual input, largely restricted to concrete-word semantics in low-data regimes, and often diminished when textual distributional signals are plentiful. Grounded models tend to learn qualitatively different representations, yet current multimodal approaches struggle to integrate vision and language to produce human-like word representations at scale. The work underscores the need for new learning mechanisms and richer, dynamic visual signals to realize more robust visually grounded language acquisition in machines.

Abstract

Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways - requiring orders of magnitude more language data than children receive during development, and without perceptual or social context. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary visual supervision, on datasets of varying scales. We then evaluate these models' learning of syntactic categories, lexical relations, semantic features, word similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-data regime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images is not redundant -- models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multimodal modeling approaches fail to effectively leverage visual information to build human-like word representations from human-scale data.

Visual Grounding Helps Learn Word Meanings in Low-Data Regimes

TL;DR

This study probes whether visual grounding improves word learning in neural LMs by contrasting grounded architectures (CLIP, GIT, Flamingo) with language-only baselines across dataset scales. Using a comprehensive battery of word-learning benchmarks and brain-alignment measures, the authors find limited, data-size-dependent benefits from visual input, largely restricted to concrete-word semantics in low-data regimes, and often diminished when textual distributional signals are plentiful. Grounded models tend to learn qualitatively different representations, yet current multimodal approaches struggle to integrate vision and language to produce human-like word representations at scale. The work underscores the need for new learning mechanisms and richer, dynamic visual signals to realize more robust visually grounded language acquisition in machines.

Abstract

Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways - requiring orders of magnitude more language data than children receive during development, and without perceptual or social context. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary visual supervision, on datasets of varying scales. We then evaluate these models' learning of syntactic categories, lexical relations, semantic features, word similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-data regime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images is not redundant -- models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multimodal modeling approaches fail to effectively leverage visual information to build human-like word representations from human-scale data.
Paper Structure (20 sections, 15 figures)

This paper contains 20 sections, 15 figures.

Figures (15)

  • Figure 1: In learning word meanings, visual information provides some help in low-data regime but only has limited additional utility relative to cross-word distributional information.A. Pretraining schema for Language-Only, Visual + Word, and Visual + Language models. From left to right: an example image-caption pair; Language-Only models are trained on a next-token prediction objective; Visual + Language (GIT) models include the image features in the context to predict the next token; Visual + Word (CLIP) models optimize its text encoder to generate features that are similar to the corresponding image feature and dissimilar to other image features. B. Results on word-learning benchmarks for the Language-Only (●), Visual + Word (CLIP) (▼), Visual + Language (CLIP) (◆), Visual + Word (GIT) (■), Visual + Language (GIT) (✖), and Word-Only Baseline (✚) models. The word-relatedness benchmark computes correlations between hidden representations of two words and compares these correlations to human ratings of how related these words are. The other three benchmarks evaluate the accuracy of predicting the corresponding features of words (or word pairs) from the hidden representations of words (or the differences between word representations). The X-axis is in the log scale. The width of lines in these plots represents the standard error of means across four models initialized from different random seeds.
  • Figure 2: Visual + Word and Language-Only models produce distinct representations.A. Scatter plots for the word-relatedness benchmark. Each dot represents one pair of words. Its y-value represents the relative rank after sorting the word pairs using the difference between the human relatedness judgment and the correlation of model representations. A higher y-value means more human-like. Linear regression lines are plotted on the figure with the $95\%$ confidence interval. B. The results of word-relatedness benchmarks on another dataset containing only verb words (SimVerb-3500, left) and the subset of color-word pairs in the previously used dataset. The marker-model mapping is the same as that in Figure \ref{['fig_main']}.
  • Figure 3: Adding more context to Visual + Word models, or changing the model to Flamingo, offers little benefit. The top row shows some of the small-context labels generated from the example caption in Fig. \ref{['fig_main']}. Two models with different random seeds are trained in each condition. The results are from the Language-Only (●), Visual + Language (Flamingo) (★), Visual + Context (CLIP) (♦), Visual + Context (GIT) (■), Visual + Context (Flamingo) ($\blacktriangleright$), Visual + Word (Flamingo) (◆), and Context-Only (✚) models.
  • Figure 4: Grounded models mostly underperform ungrounded models on context-based word-understanding and brain-response prediction benchmarks.A. Performance of grounded and ungrounded models trained with small contexts (three consecutive words) or full captions as image labels on the context-based word-understanding benchmark. We present the results from the Language-Only (●), Visual + Language (GIT) (✖), Visual + Language (Flamingo) (★), Visual + Context (CLIP) (♦), Visual + Context (GIT) (■), Visual + Context (Flamingo) ($\blacktriangleright$), and Context-Only (✚) models. B. Brain-response fitting results for language-only and Visual + Language models. Four models with different random seeds are trained in each condition.
  • Figure 5: Image representations have a small influence on word learning. We show the results for the DINO-ViT (▼), DINOv2-ViT (◆), DINO-Res50 (■), DINO-ViT-Trainable ($\blacktriangleright$), MAE-ViT (★), Language-Only (●), Random-ViT (✖), and Random-ViT-Trainable (✚) models. The Visual + Word (CLIP) training regime is used for these experiments. The DINO-ViT model represents the Visual + Word (CLIP) models in previous figures. Two models with different seeds are trained in each condition.
  • ...and 10 more figures