Table of Contents
Fetching ...

Masked Vision and Language Modeling for Multi-modal Representation Learning

Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul Bhotika, Stefano Soatto

TL;DR

This work addresses vision-language representation learning by introducing MaskVLM, a joint masked modeling approach that reconstructs masked signals in one modality from the other and learns cross-modal alignment via ITC and ITM losses. It frames the learning around modeling both p(T|I) and p(I|T) in an end-to-end transformer framework, backed by a probabilistic interpretation using variation of information. Empirically, MaskVLM achieves state-of-the-art results on image-text retrieval and competitive performance on VQA, NLVR2, and VE with ~4M pre-training data, while showing strong data-efficiency in limited-data regimes. The approach eliminates reliance on frozen detectors or tokenizers, enabling end-to-end V+L interaction and yielding robust cross-modal representations for diverse tasks.

Abstract

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.

Masked Vision and Language Modeling for Multi-modal Representation Learning

TL;DR

This work addresses vision-language representation learning by introducing MaskVLM, a joint masked modeling approach that reconstructs masked signals in one modality from the other and learns cross-modal alignment via ITC and ITM losses. It frames the learning around modeling both p(T|I) and p(I|T) in an end-to-end transformer framework, backed by a probabilistic interpretation using variation of information. Empirically, MaskVLM achieves state-of-the-art results on image-text retrieval and competitive performance on VQA, NLVR2, and VE with ~4M pre-training data, while showing strong data-efficiency in limited-data regimes. The approach eliminates reliance on frozen detectors or tokenizers, enabling end-to-end V+L interaction and yielding robust cross-modal representations for diverse tasks.

Abstract

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.
Paper Structure (24 sections, 3 equations, 8 figures, 9 tables)

This paper contains 24 sections, 3 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: An overview of masked vision and language modeling. The left side shows existing approaches and the right side highlights our proposed approach.
  • Figure 2: A framework of joint modeling of masked vision and language. The blue and green lines demonstrate the information flow for image and text reconstruction, respectively. The dotted lines indicate the cross-modal input of unmasked signals for generating attention.
  • Figure 3: Visualization of image (text) encoders and image (text) cross-modality encoders.
  • Figure 4: R@1 plots for image retrieval (left) and text retrieval (right) on COCO using limited pre-training data.
  • Figure 5: Masked language modeling examples using masked and original images. "Recon (mask)" and "Recon (org)" denote reconstructed text from the masked image and the original image, respectively.
  • ...and 3 more figures