Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao
TL;DR
This survey maps the rapid development of Vision-Language Pre-training (VLP) across image-text, core vision, and video-text tasks, highlighting a shift from task-specific designs to large-scale pre-training and unified, foundation-model-style approaches. It details architectures (vision/text encoders, fusion strategies), pre-training objectives (MLM, ITM, ITC, MIM), and data sources, while discussing advanced topics like big models, in-context learning, memory-efficient adaptation, and multilinguality. The paper emphasizes the transition to open-set and in-the-wild capabilities enabled by language supervision, and discusses industrial deployment considerations such as domain adaptation, cost, and fairness. It also outlines a roadmap toward general-purpose multimodal foundation models through unified modeling, robust evaluation, and knowledge integration, with a forward-looking view on T2I generation, multi-channel video understanding, and cross-modal collaboration. Overall, VLP stands to substantially broaden the applicability and robustness of multimodal AI by leveraging language as a universal supervision signal and a flexible interface for open-ended tasks.
Abstract
This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: ($i$) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; ($ii$) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and ($iii$) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.
