Table of Contents
Fetching ...

Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

Wenke Huang, Jian Liang, Xianda Guo, Yiyang Fang, Guancheng Wan, Xuankun Rong, Chi Wen, Zekun Shi, Qingyun Li, Didi Zhu, Yanbiao Ma, Ke Liang, Bin Yang, He Li, Jiawei Shao, Mang Ye, Bo Du

TL;DR

This survey analyzes downstream tuning of Multimodal Large Language Models (MLLMs) with a focus on Task-Expert Specialization ($E$) and Open-World Stabilization ($F$). It introduces a three-way taxonomy—Selective Tuning, Additive Tuning, and Reparameterization Tuning—and uses $O=E+F$ to jointly assess performance. Benchmark results across architectures like LLaVA-OV and VILA on datasets such as OKVQA, TextVQA, PathVQA, and ScienceQA reveal trade-offs between specialization and forgetting and yield principled tuning guidelines. A public repository tracks developments and standardizes evaluation to accelerate practical deployment in real-world domains.

Abstract

Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering. While MLLMs demonstrate remarkable versatility, MLLMs appears limited performance on special applications. But tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert Specialization, where distribution shifts between pre-training and target datasets constrain target performance, and Open-World Stabilization, where catastrophic forgetting erases the model general knowledge. In this work, we systematically review recent advancements in MLLM tuning methodologies, classifying them into three paradigms: (I) Selective Tuning, (II) Additive Tuning, and (III) Reparameterization Tuning. Furthermore, we benchmark these tuning strategies across popular MLLM architectures and diverse downstream tasks to establish standardized evaluation analysis and systematic tuning principles. Finally, we highlight several open challenges in this domain and propose future research directions. To facilitate ongoing progress in this rapidly evolving field, we provide a public repository that continuously tracks developments: https://github.com/WenkeHuang/Awesome-MLLM-Tuning.

Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

TL;DR

This survey analyzes downstream tuning of Multimodal Large Language Models (MLLMs) with a focus on Task-Expert Specialization () and Open-World Stabilization (). It introduces a three-way taxonomy—Selective Tuning, Additive Tuning, and Reparameterization Tuning—and uses to jointly assess performance. Benchmark results across architectures like LLaVA-OV and VILA on datasets such as OKVQA, TextVQA, PathVQA, and ScienceQA reveal trade-offs between specialization and forgetting and yield principled tuning guidelines. A public repository tracks developments and standardizes evaluation to accelerate practical deployment in real-world domains.

Abstract

Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering. While MLLMs demonstrate remarkable versatility, MLLMs appears limited performance on special applications. But tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert Specialization, where distribution shifts between pre-training and target datasets constrain target performance, and Open-World Stabilization, where catastrophic forgetting erases the model general knowledge. In this work, we systematically review recent advancements in MLLM tuning methodologies, classifying them into three paradigms: (I) Selective Tuning, (II) Additive Tuning, and (III) Reparameterization Tuning. Furthermore, we benchmark these tuning strategies across popular MLLM architectures and diverse downstream tasks to establish standardized evaluation analysis and systematic tuning principles. Finally, we highlight several open challenges in this domain and propose future research directions. To facilitate ongoing progress in this rapidly evolving field, we provide a public repository that continuously tracks developments: https://github.com/WenkeHuang/Awesome-MLLM-Tuning.

Paper Structure

This paper contains 38 sections, 12 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of the survey. Best viewed in color.
  • Figure 2: The flow chart of Multimodal Large Language Model Tuning Paradigm. Refer to \ref{['sec:formulation']} for details.
  • Figure 3: Performance Comparison on both Upstream and Downstream tasks with or without Vision Projector $\varphi$. Tuning projector benefits those distinct target distribution, e.g., PathVQA, and RSVQA. Refer to \ref{['sec:exp_com']} for discussion.

Theorems & Definitions (2)

  • Definition 2.1: Specialization Improvement
  • Definition 2.2: Stabilization Forgetting