Table of Contents
Fetching ...

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling

Chengxu Zhuang, Evelina Fedorenko, Jacob Andreas

TL;DR

Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization.

Abstract

Today's most accurate language models are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple language modeling tasks. This work underscores the potential of incorporating visual grounding into language models, aligning more closely with the multimodal nature of human language acquisition.

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling

TL;DR

Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization.

Abstract

Today's most accurate language models are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple language modeling tasks. This work underscores the potential of incorporating visual grounding into language models, aligning more closely with the multimodal nature of human language acquisition.
Paper Structure (19 sections, 3 equations, 13 figures)

This paper contains 19 sections, 3 equations, 13 figures.

Figures (13)

  • Figure 1: LexiContrastive Grounding models leverage visual information to facilitate word learning when they are trained on image-caption datasets.A. Pretraining schema for the LexiContrastive Grounding models. The images are sent to a frozen visual encoder pretrained using unsupervised learning algorithms to generate image features. These image features and the hidden representations of the first layer after the token-embedding layer are used to compute a vision-language contrastive loss. This loss is added to the next-token prediction loss to form the final loss. B. Results from the grounded-only learning scenario on word-learning benchmarks for LexiContrastive Grounding (●), Language-Only (■), CLIP (◆), GIT (▼), and Flamingo (✖). The X-axis is plotted in the log scale. Each point represents the average performance from four models initialized from different random seeds, and the line width represents the S.E.M. from these four models. C. Results from the mixed-learning scenario on the language modeling and the word learning benchmarks. We also add LexiVoken Grounding (✚) and Vokenization (♦) models. The ungrounded dataset is Smashwords-5M. Different dots of the same color represent models with different random initialization seeds.
  • Figure 2: LexiContrastive Grounding models better learn concrete words than Language-Only models.A. Scatter plot for an analysis on the word-relatedness benchmark for the LCG model trained with 2.1M image-caption pairs. Each point on the plot corresponds to a pair of words, with its Y-value indicating the relative rank obtained by sorting the word pairs based on the difference between human and model judgments. A greater Y-value signifies a closer resemblance to human judgment. Additionally, linear regression lines are depicted on the graph along with their respective $95\%$ confidence intervals. B. The results of SimVerb-3500, a word-relatedness benchmark evaluating models only on verb words. The marker-to-model map is the same as that in Fig. \ref{['fig_main']}. C. Distributions of the per-word prediction performance difference between LexiContrastive Grounding and Language-Only models grouped by concreteness of words. The prediction performance is the negative loglikelihood of the corresponding word averaged across all appearances in the test dataset. The LexiContrastive Grounding and Language-Only models are taken from the “same” condition in Fig. \ref{['fig_main']}C. A positive difference means that the LexiContrastive Grounding model is better than the Language-Only model.
  • Figure 3: Ablation studies support the algorithm design of LexiContrastive Grounding. Less Grounding (★) model changes $\lambda_{c}$ to 0.1. More Grounding (✖) changes it to 1. No-Narrow-Att (▼) model has the typical attention layer as the first layer. Mid-Grounding ($\blacktriangleright$) model computes the grounding loss from the third layer. Sentence CLIP (◆) model computes the sentence-level CLIP loss from the top layer as the grounding loss.
  • Figure 4: LexiContrastive Grounding models yield stronger language learning performance than other models when they are co-trained on image-caption and language-only datasets.A. Performance on the language modeling and the word learning benchmarks for LexiContrastive Grounding (●), LexiContrastive Grounding using Vokens (✚), Language-Only (■), Vokenization (♦), GIT (▼), and Flamingo (✖). The models are trained on a mix of image captions and a language-only dataset containing 5M tokens sampled from CHILDES. The language modeling benchmark evaluates the perplexity of the models on the held-out set of the corresponding language-only datasets. Different dots of the same color represent models with different random initialization seeds. B. The language-only dataset is a subset of Smashwords containing 15M tokens.
  • Figure 5: Perplexity on the Smashwords validation set for models trained with different $\lambda_{u}$ in the training setup with 5M tokens from Smashwords and 2.5M tokens in coupled image-caption pairs. For each algorithm and each $\lambda_{u}$, two models are trained from different initialization seeds. LCG represents the LexiContrastive Grounding , and LVG represents LexiVoken Grounding .
  • ...and 8 more figures