Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
Hao Tan, Mohit Bansal
TL;DR
This work addresses the gap between visually grounded language data and large-scale text corpora by introducing vokenization, a pipeline that contextualizes tokens with related images (vokens) using a learned contextual token-image matcher. A vokenizer trained on image-captioning data annotates large text corpora, enabling a visually supervised pre-training objective (voken-classification) alongside standard masked language modeling. Empirical results show consistent improvements across GLUE, SQuAD, and SWAG, with successful transfer to RoBERTa, validating the approach and its potential to enrich language understanding with grounded visual information. The paper also introduces revokenization to facilitate cross-framework transfer and provides analyses on when grounding helps, how token-level grounding compares to sentence-level methods, and visualizations of learned vokens. Overall, the work demonstrates that contextually grounded visual supervision can augment pure-language understanding and offers a scalable path to multimodal-informed language models.
Abstract
Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG. Code and pre-trained models publicly available at https://github.com/airsplay/vokenization
