How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey
Yayun Qi, Hongxi Li, Yiqi Song, Xinxiao Wu, Jiebo Luo
TL;DR
This survey analyzes how large pre-trained models enhance vision-language tasks by organizing methods around four core challenges: data scarcity, escalating reasoning complexity, generalization to novel samples, and task diversity. It details two families of pre-trained models—large language models and vision-language models—and catalogs a spectrum of paradigms, including direct inference, learning from unlabeled data, pseudo data generation, divide-and-conquer reasoning, chain-of-thought, semantic-context extraction from LLMs, teacher-student distillation from VLMs, continual learning, and planning with natural language or code. Key findings show that pre-trained models enable zero-shot and few-shot capabilities, improve open-vocabulary understanding, and support modular, planner-driven systems, while also introducing risks such as hallucination, outdated knowledge, concept bias, and compositional confusion that require mitigation. The practical impact lies in providing a roadmap for researchers to leverage pre-trained models for diverse vision-language tasks and to design robust, scalable, and interpretable systems. The paper also highlights future directions like knowledge integration, verification mechanisms, and unified modular frameworks to handle task diversity and evolving knowledge.
Abstract
The exploration of various vision-language tasks, such as visual captioning, visual question answering, and visual commonsense reasoning, is an important area in artificial intelligence and continuously attracts the research community's attention. Despite the improvements in overall performance, classic challenges still exist in vision-language tasks and hinder the development of this area. In recent years, the rise of pre-trained models is driving the research on vision-language tasks. Thanks to the massive scale of training data and model parameters, pre-trained models have exhibited excellent performance in numerous downstream tasks. Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges. Such methods have become mainstream in current research with increasing attention and rapid advances. In this paper, we present a comprehensive overview of how vision-language tasks benefit from pre-trained models. First, we review several main challenges in vision-language tasks and discuss the limitations of previous solutions before the era of pre-training. Next, we summarize the recent advances in incorporating pre-trained models to address the challenges in vision-language tasks. Finally, we analyze the potential risks associated with the inherent limitations of pre-trained models and discuss possible solutions, attempting to provide future research directions.
