Table of Contents
Fetching ...

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang

TL;DR

<3-5 sentence high-level summary>

Abstract

Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

TL;DR

<3-5 sentence high-level summary>

Abstract

Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.
Paper Structure (50 sections, 10 equations, 9 figures, 13 tables)

This paper contains 50 sections, 10 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Performance of different models on VQA test-dev under different training hours. Training hours of all models are reproduced by us on A100 GPUs.
  • Figure 2: Overview of EVE and Masked Signal Modeling. We use a unified architecture with shared attention and Modality-Aware MoE for EVE and a single unified masked signal modeling for pre-training. We employ random masking on both image and text. Masked image and complete text are used in masked image modeling, vice versa.
  • Figure 3: Architecture of Modality-Aware MoE.
  • Figure 4: Ablation study on masking ratio. Left and right y-axis denote VQA accuracy and Flickr mean recall.
  • Figure 5: Ablation study on the number of experts and top-$k$ design. We use soft router in [8, 10, 12] Transformer blocks.
  • ...and 4 more figures