Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

Haowei Liu; Yaya Shi; Haiyang Xu; Chunfeng Yuan; Qinghao Ye; Chenliang Li; Ming Yan; Ji Zhang; Fei Huang; Bing Li; Weiming Hu

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu

TL;DR

This work tackles fine-grained cross-modal semantic alignment in vision-language pre-training by addressing two limitations of prior masked image modeling: lacking high-level semantic supervision and limited text involvement. It introduces SemMIM, a semantics-enhanced cross-modal MIM framework that learns high-level visual semantics from global image features using a momentum encoder and transfers them to local patch encodings via a shared encoding space. It further deepens text involvement through a text-guided masking strategy, text-feature fusion during masked modeling, and injecting text semantics into momentum-encoded patch representations, using a learnable $K$-dimensional encoding space for MIM targets. Across MSCOCO, Flickr30K, COCO Caption, and VQA benchmarks, SemMIM achieves state-of-the-art or competitive results with a similar model size and data scale, with ablations validating the benefits of local semantics and text-guided cross-modal modeling.

Abstract

In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment. However, in most existing methods, the reconstruction targets for MIM lack high-level semantics, and text is not sufficiently involved in masked modeling. These two drawbacks limit the effect of MIM in facilitating cross-modal semantic alignment. In this work, we propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning. Specifically, to provide more semantically meaningful supervision for MIM, we propose a local semantics enhancing approach, which harvest high-level semantics from global image features via self-supervised agreement learning and transfer them to local patch encodings by sharing the encoding space. Moreover, to achieve deep involvement of text during the entire MIM process, we propose a text-guided masking strategy and devise an efficient way of injecting textual information in both masked modeling and reconstruction target acquisition. Experimental results validate that our method improves the effectiveness of the MIM task in facilitating cross-modal semantic alignment. Compared to previous VLP models with similar model size and data scale, our SemMIM model achieves state-of-the-art or competitive performance on multiple downstream vision-language tasks.

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

TL;DR

-dimensional encoding space for MIM targets. Across MSCOCO, Flickr30K, COCO Caption, and VQA benchmarks, SemMIM achieves state-of-the-art or competitive results with a similar model size and data scale, with ablations validating the benefits of local semantics and text-guided cross-modal modeling.

Abstract

Paper Structure (20 sections, 4 equations, 3 figures, 8 tables)

This paper contains 20 sections, 4 equations, 3 figures, 8 tables.

Introduction
Related Work
Vision-Language Pre-training
Masked Image Modeling in VLP
Method
Model Architecture
Semantics-enhanced Cross-modal Masked Image Modeling
Local Semantics Enhancing
Text-deeply-involved Cross-modal MIM
Pre-training Objectives
Experiments
Pre-training Datasets
Implementation Detail
Evaluation on Downstream Vision-Language Tasks
Ablation Study
...and 5 more sections

Figures (3)

Figure 1: Overview of our method. (a) shows the model architecture and pre-training objectives of our SemMIM framework. (b) illustrates the proposed local semantics enhancing approach, which harvests high-level semantics from global visual features via agreement learning, and transfer them into patch encodings by sharing the same encoding space. And (b) also shows our designs for deep involvement of text during MIM, including text-guided masking strategy and injecting textual information into both masked modeling and reconstruction target acquisition.
Figure 2: Pattern clusters of image patch encodings. The left six figures (our model) showcase high-level semantic patterns the eye of elephant, banana, bowtie, hand, cat and computer. The right two figures (dVAE) showcase relatively low-level visual patterns circle texture and yellow color.
Figure 3: Visualization of pattern layout of full images. Different patterns are shown in different colors.

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

TL;DR

Abstract

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (3)