Table of Contents
Fetching ...

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Yayun Qi, Hongxi Li, Yiqi Song, Xinxiao Wu, Jiebo Luo

TL;DR

This survey analyzes how large pre-trained models enhance vision-language tasks by organizing methods around four core challenges: data scarcity, escalating reasoning complexity, generalization to novel samples, and task diversity. It details two families of pre-trained models—large language models and vision-language models—and catalogs a spectrum of paradigms, including direct inference, learning from unlabeled data, pseudo data generation, divide-and-conquer reasoning, chain-of-thought, semantic-context extraction from LLMs, teacher-student distillation from VLMs, continual learning, and planning with natural language or code. Key findings show that pre-trained models enable zero-shot and few-shot capabilities, improve open-vocabulary understanding, and support modular, planner-driven systems, while also introducing risks such as hallucination, outdated knowledge, concept bias, and compositional confusion that require mitigation. The practical impact lies in providing a roadmap for researchers to leverage pre-trained models for diverse vision-language tasks and to design robust, scalable, and interpretable systems. The paper also highlights future directions like knowledge integration, verification mechanisms, and unified modular frameworks to handle task diversity and evolving knowledge.

Abstract

The exploration of various vision-language tasks, such as visual captioning, visual question answering, and visual commonsense reasoning, is an important area in artificial intelligence and continuously attracts the research community's attention. Despite the improvements in overall performance, classic challenges still exist in vision-language tasks and hinder the development of this area. In recent years, the rise of pre-trained models is driving the research on vision-language tasks. Thanks to the massive scale of training data and model parameters, pre-trained models have exhibited excellent performance in numerous downstream tasks. Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges. Such methods have become mainstream in current research with increasing attention and rapid advances. In this paper, we present a comprehensive overview of how vision-language tasks benefit from pre-trained models. First, we review several main challenges in vision-language tasks and discuss the limitations of previous solutions before the era of pre-training. Next, we summarize the recent advances in incorporating pre-trained models to address the challenges in vision-language tasks. Finally, we analyze the potential risks associated with the inherent limitations of pre-trained models and discuss possible solutions, attempting to provide future research directions.

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

TL;DR

This survey analyzes how large pre-trained models enhance vision-language tasks by organizing methods around four core challenges: data scarcity, escalating reasoning complexity, generalization to novel samples, and task diversity. It details two families of pre-trained models—large language models and vision-language models—and catalogs a spectrum of paradigms, including direct inference, learning from unlabeled data, pseudo data generation, divide-and-conquer reasoning, chain-of-thought, semantic-context extraction from LLMs, teacher-student distillation from VLMs, continual learning, and planning with natural language or code. Key findings show that pre-trained models enable zero-shot and few-shot capabilities, improve open-vocabulary understanding, and support modular, planner-driven systems, while also introducing risks such as hallucination, outdated knowledge, concept bias, and compositional confusion that require mitigation. The practical impact lies in providing a roadmap for researchers to leverage pre-trained models for diverse vision-language tasks and to design robust, scalable, and interpretable systems. The paper also highlights future directions like knowledge integration, verification mechanisms, and unified modular frameworks to handle task diversity and evolving knowledge.

Abstract

The exploration of various vision-language tasks, such as visual captioning, visual question answering, and visual commonsense reasoning, is an important area in artificial intelligence and continuously attracts the research community's attention. Despite the improvements in overall performance, classic challenges still exist in vision-language tasks and hinder the development of this area. In recent years, the rise of pre-trained models is driving the research on vision-language tasks. Thanks to the massive scale of training data and model parameters, pre-trained models have exhibited excellent performance in numerous downstream tasks. Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges. Such methods have become mainstream in current research with increasing attention and rapid advances. In this paper, we present a comprehensive overview of how vision-language tasks benefit from pre-trained models. First, we review several main challenges in vision-language tasks and discuss the limitations of previous solutions before the era of pre-training. Next, we summarize the recent advances in incorporating pre-trained models to address the challenges in vision-language tasks. Finally, we analyze the potential risks associated with the inherent limitations of pre-trained models and discuss possible solutions, attempting to provide future research directions.

Paper Structure

This paper contains 32 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An illustration of four classic challenges in vision-language tasks.
  • Figure 2: Paradigms for addressing the data scarcity challenge in vision-language tasks with the help of pre-trained models. (a) shows the paradigm of integrating an LLM with CLIP to perform direct inference on the test sample. (b) shows the paradigm of converting visual information into texts by a VLM to perform direct inference on the test sample. (c) shows the paradigm of learning from unlabeled uni-modal data. (d) shows the paradigm of using a VLM to generate pseudo paired data for training or evaluation.
  • Figure 3: Paradigms of using pre-trained models to conquer the challenge of escalating reasoning complexity in vision-language tasks. (a) shows the basic idea of the divide-and-conquer solution. (b) shows the chain-of-thought solution, which includes two different pipelines for linguistic and visual perspectives.
  • Figure 4: Paradigms of using pre-trained models to conquer the generalization challenge to novel samples in vision-language tasks. (a) shows the basic idea of extracting semantic context from an LLM, while (b) shows the basic idea of distilling teacher knowledge from a VLM.
  • Figure 5: Paradigm of using pre-trained models to conquer the task diversity challenge in vision-language tasks. (a) shows the basic idea of applying continual learning strategies on a single VLM to enable it to address different vision-language tasks. (b) and (c) show the basic idea of building a general modular system that plans with natural language and code statements to address different vision-language tasks, respectively.