Table of Contents
Fetching ...

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao

TL;DR

This survey maps the rapid development of Vision-Language Pre-training (VLP) across image-text, core vision, and video-text tasks, highlighting a shift from task-specific designs to large-scale pre-training and unified, foundation-model-style approaches. It details architectures (vision/text encoders, fusion strategies), pre-training objectives (MLM, ITM, ITC, MIM), and data sources, while discussing advanced topics like big models, in-context learning, memory-efficient adaptation, and multilinguality. The paper emphasizes the transition to open-set and in-the-wild capabilities enabled by language supervision, and discusses industrial deployment considerations such as domain adaptation, cost, and fairness. It also outlines a roadmap toward general-purpose multimodal foundation models through unified modeling, robust evaluation, and knowledge integration, with a forward-looking view on T2I generation, multi-channel video understanding, and cross-modal collaboration. Overall, VLP stands to substantially broaden the applicability and robustness of multimodal AI by leveraging language as a universal supervision signal and a flexible interface for open-ended tasks.

Abstract

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: ($i$) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; ($ii$) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and ($iii$) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

TL;DR

This survey maps the rapid development of Vision-Language Pre-training (VLP) across image-text, core vision, and video-text tasks, highlighting a shift from task-specific designs to large-scale pre-training and unified, foundation-model-style approaches. It details architectures (vision/text encoders, fusion strategies), pre-training objectives (MLM, ITM, ITC, MIM), and data sources, while discussing advanced topics like big models, in-context learning, memory-efficient adaptation, and multilinguality. The paper emphasizes the transition to open-set and in-the-wild capabilities enabled by language supervision, and discusses industrial deployment considerations such as domain adaptation, cost, and fairness. It also outlines a roadmap toward general-purpose multimodal foundation models through unified modeling, robust evaluation, and knowledge integration, with a forward-looking view on T2I generation, multi-channel video understanding, and cross-modal collaboration. Overall, VLP stands to substantially broaden the applicability and robustness of multimodal AI by leveraging language as a universal supervision signal and a flexible interface for open-ended tasks.

Abstract

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: () VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; () VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and () VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.
Paper Structure (142 sections, 21 equations, 41 figures, 5 tables)

This paper contains 142 sections, 21 equations, 41 figures, 5 tables.

Figures (41)

  • Figure 1: Overview of the paper structure, detailing Chapter \ref{['chp:basics']}-\ref{['chp:vlp4videotxt']}.
  • Figure 2: Illustration of representative tasks from three categories of VL problems covered in this paper: image-text tasks, vision tasks as VL problems, and video-text tasks.
  • Figure 3: The transition from task-specific methods to large-scale pre-training, using the VQA task as a case study. Every time when there was a transition, we observe a big performance lift, e.g., from MCAN yu2019deep to UNITER chen2020uniter, and from ALBEF li2021align to SimVLM wang2021simvlm. Methods before August 2017 were not drawn; only some representative VLP works are shown to avoid the figure to be too crowded.
  • Figure 4: Illustration of representative vision-language tasks with image-text inputs: ($i$) image-text retrieval; ($ii$) visual question answering and visual reasoning; and ($iii$) image captioning with a single-sentence caption, or a more descriptive paragraph of captions.
  • Figure 5: Illustration of a general framework for task-specific VQA models. In most cases, image features are extracted offline, with no gradient update to the visual encoder during model training.
  • ...and 36 more figures