Table of Contents
Fetching ...

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Hao Tan, Mohit Bansal

TL;DR

This work addresses the gap between visually grounded language data and large-scale text corpora by introducing vokenization, a pipeline that contextualizes tokens with related images (vokens) using a learned contextual token-image matcher. A vokenizer trained on image-captioning data annotates large text corpora, enabling a visually supervised pre-training objective (voken-classification) alongside standard masked language modeling. Empirical results show consistent improvements across GLUE, SQuAD, and SWAG, with successful transfer to RoBERTa, validating the approach and its potential to enrich language understanding with grounded visual information. The paper also introduces revokenization to facilitate cross-framework transfer and provides analyses on when grounding helps, how token-level grounding compares to sentence-level methods, and visualizations of learned vokens. Overall, the work demonstrates that contextually grounded visual supervision can augment pure-language understanding and offers a scalable path to multimodal-informed language models.

Abstract

Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG. Code and pre-trained models publicly available at https://github.com/airsplay/vokenization

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

TL;DR

This work addresses the gap between visually grounded language data and large-scale text corpora by introducing vokenization, a pipeline that contextualizes tokens with related images (vokens) using a learned contextual token-image matcher. A vokenizer trained on image-captioning data annotates large text corpora, enabling a visually supervised pre-training objective (voken-classification) alongside standard masked language modeling. Empirical results show consistent improvements across GLUE, SQuAD, and SWAG, with successful transfer to RoBERTa, validating the approach and its potential to enrich language understanding with grounded visual information. The paper also introduces revokenization to facilitate cross-framework transfer and provides analyses on when grounding helps, how token-level grounding compares to sentence-level methods, and visualizations of learned vokens. Overall, the work demonstrates that contextually grounded visual supervision can augment pure-language understanding and offers a scalable path to multimodal-informed language models.

Abstract

Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG. Code and pre-trained models publicly available at https://github.com/airsplay/vokenization

Paper Structure

This paper contains 39 sections, 17 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: We visually supervise the language model with token-related images. We call these images vokens (visualized tokens) and develop a vokenization process to contextually generate them.
  • Figure 2: Illustration of the BERT transformer model trained with a visually-supervised language model with two objectives: masked language model (on the left) and voken classification (on the right). The first objective (used in original BERT pre-training) predicts the masked tokens as self-supervision while the second objective predicts the corresponding vokens (contextually generated by our vokenization process) as external visual supervision. Since the inputs are the same, we optimize the two objectives simultaneously and share the model weights.
  • Figure 3: Implementation of our vokenization process. For the tokens in language corpora, we contextually retrieved images (with nearest neighbor search) from the image set as vokens. These generated vokens are then used as the visual supervision to the language model.
  • Figure 4: Visualization of model-generated vokens. Example 1 takes the leading sentence of this paper while Examples 2 takes Yeats's poet.