Table of Contents
Fetching ...

Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective

Xiaorui Ma, Haoran Xie, S. Joe Qin

TL;DR

This survey addresses the challenge of efficiently fusing visual perception into large language models by organizing approaches into three training paradigms: Single-stage Tuning, Two-stage Tuning, and Direct Adaptation, and by reviewing 34 Vision-Language Large Language Models. It investigates LLM architectures, vision encoders, and modality integrators, emphasizing parameter-efficient adaptation via prompt-based, adapter-based, and LoRA-based techniques, alongside learning paradigms such as Multi-task Learning, Instruction Tuning, and RLHF. The analysis shows that Two-stage Tuning generally yields the strongest performance, aided by instruction tuning and reparameterization, while Direct Adaptation excels in parameter and memory efficiency, especially when using compact LLMs and efficient MI designs. The work provides concrete guidance for researchers and practitioners on selecting training paradigms and MI architectures to achieve effective and scalable multimodal reasoning in real-world deployments.

Abstract

The integration of vision-language modalities has been a significant focus in multimodal learning, traditionally relying on Vision-Language Pretrained Models. However, with the advent of Large Language Models (LLMs), there has been a notable shift towards incorporating LLMs with vision modalities. Following this, the training paradigms for incorporating vision modalities into LLMs have evolved. Initially, the approach was to integrate the modalities through pretraining the modality integrator, named Single-stage Tuning. It has since branched out into methods focusing on performance enhancement, denoted as Two-stage Tuning, and those prioritizing parameter efficiency, referred to as Direct Adaptation. However, existing surveys primarily address the latest Vision Large Language Models (VLLMs) with Two-stage Tuning, leaving a gap in understanding the evolution of training paradigms and their unique parameter-efficient considerations. This paper categorizes and reviews 34 VLLMs from top conferences, journals, and highly cited Arxiv papers, focusing on parameter efficiency during adaptation from the training paradigm perspective. We first introduce the architecture of LLMs and parameter-efficient learning methods, followed by a discussion on vision encoders and a comprehensive taxonomy of modality integrators. We then review three training paradigms and their efficiency considerations, summarizing benchmarks in the VLLM field. To gain deeper insights into their effectiveness in parameter efficiency, we compare and discuss the experimental results of representative models, among which the experiment of the Direct Adaptation paradigm is replicated. Providing insights into recent developments and practical uses, this survey is a vital guide for researchers and practitioners navigating the efficient integration of vision modalities into LLMs.

Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective

TL;DR

This survey addresses the challenge of efficiently fusing visual perception into large language models by organizing approaches into three training paradigms: Single-stage Tuning, Two-stage Tuning, and Direct Adaptation, and by reviewing 34 Vision-Language Large Language Models. It investigates LLM architectures, vision encoders, and modality integrators, emphasizing parameter-efficient adaptation via prompt-based, adapter-based, and LoRA-based techniques, alongside learning paradigms such as Multi-task Learning, Instruction Tuning, and RLHF. The analysis shows that Two-stage Tuning generally yields the strongest performance, aided by instruction tuning and reparameterization, while Direct Adaptation excels in parameter and memory efficiency, especially when using compact LLMs and efficient MI designs. The work provides concrete guidance for researchers and practitioners on selecting training paradigms and MI architectures to achieve effective and scalable multimodal reasoning in real-world deployments.

Abstract

The integration of vision-language modalities has been a significant focus in multimodal learning, traditionally relying on Vision-Language Pretrained Models. However, with the advent of Large Language Models (LLMs), there has been a notable shift towards incorporating LLMs with vision modalities. Following this, the training paradigms for incorporating vision modalities into LLMs have evolved. Initially, the approach was to integrate the modalities through pretraining the modality integrator, named Single-stage Tuning. It has since branched out into methods focusing on performance enhancement, denoted as Two-stage Tuning, and those prioritizing parameter efficiency, referred to as Direct Adaptation. However, existing surveys primarily address the latest Vision Large Language Models (VLLMs) with Two-stage Tuning, leaving a gap in understanding the evolution of training paradigms and their unique parameter-efficient considerations. This paper categorizes and reviews 34 VLLMs from top conferences, journals, and highly cited Arxiv papers, focusing on parameter efficiency during adaptation from the training paradigm perspective. We first introduce the architecture of LLMs and parameter-efficient learning methods, followed by a discussion on vision encoders and a comprehensive taxonomy of modality integrators. We then review three training paradigms and their efficiency considerations, summarizing benchmarks in the VLLM field. To gain deeper insights into their effectiveness in parameter efficiency, we compare and discuss the experimental results of representative models, among which the experiment of the Direct Adaptation paradigm is replicated. Providing insights into recent developments and practical uses, this survey is a vital guide for researchers and practitioners navigating the efficient integration of vision modalities into LLMs.

Paper Structure

This paper contains 35 sections, 27 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Integrated Modules and Three Training Paradigms. MI denotes modality integrator, and VE denotes vision encoder. The trainable module and learning paradigm are the most adopted.
  • Figure 2: The Taxonomy and Publishing Time. AC denotes the annual citation times. For published work, the horizontal axis shows the Published time, while for unpublished work, it shows the submission time to Arxiv.
  • Figure 3: Taxonomy of Modality Integrator.