Masked Vision and Language Modeling for Multi-modal Representation Learning
Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul Bhotika, Stefano Soatto
TL;DR
This work addresses vision-language representation learning by introducing MaskVLM, a joint masked modeling approach that reconstructs masked signals in one modality from the other and learns cross-modal alignment via ITC and ITM losses. It frames the learning around modeling both p(T|I) and p(I|T) in an end-to-end transformer framework, backed by a probabilistic interpretation using variation of information. Empirically, MaskVLM achieves state-of-the-art results on image-text retrieval and competitive performance on VQA, NLVR2, and VE with ~4M pre-training data, while showing strong data-efficiency in limited-data regimes. The approach eliminates reliance on frozen detectors or tokenizers, enabling end-to-end V+L interaction and yielding robust cross-modal representations for diverse tasks.
Abstract
In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.
