Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

Wenke Huang; Jian Liang; Xianda Guo; Yiyang Fang; Guancheng Wan; Xuankun Rong; Chi Wen; Zekun Shi; Qingyun Li; Didi Zhu; Yanbiao Ma; Ke Liang; Bin Yang; He Li; Jiawei Shao; Mang Ye; Bo Du

Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

Wenke Huang, Jian Liang, Xianda Guo, Yiyang Fang, Guancheng Wan, Xuankun Rong, Chi Wen, Zekun Shi, Qingyun Li, Didi Zhu, Yanbiao Ma, Ke Liang, Bin Yang, He Li, Jiawei Shao, Mang Ye, Bo Du

TL;DR

This survey analyzes downstream tuning of Multimodal Large Language Models (MLLMs) with a focus on Task-Expert Specialization ($E$) and Open-World Stabilization ($F$). It introduces a three-way taxonomy—Selective Tuning, Additive Tuning, and Reparameterization Tuning—and uses $O=E+F$ to jointly assess performance. Benchmark results across architectures like LLaVA-OV and VILA on datasets such as OKVQA, TextVQA, PathVQA, and ScienceQA reveal trade-offs between specialization and forgetting and yield principled tuning guidelines. A public repository tracks developments and standardizes evaluation to accelerate practical deployment in real-world domains.

Abstract

Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering. While MLLMs demonstrate remarkable versatility, MLLMs appears limited performance on special applications. But tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert Specialization, where distribution shifts between pre-training and target datasets constrain target performance, and Open-World Stabilization, where catastrophic forgetting erases the model general knowledge. In this work, we systematically review recent advancements in MLLM tuning methodologies, classifying them into three paradigms: (I) Selective Tuning, (II) Additive Tuning, and (III) Reparameterization Tuning. Furthermore, we benchmark these tuning strategies across popular MLLM architectures and diverse downstream tasks to establish standardized evaluation analysis and systematic tuning principles. Finally, we highlight several open challenges in this domain and propose future research directions. To facilitate ongoing progress in this rapidly evolving field, we provide a public repository that continuously tracks developments: https://github.com/WenkeHuang/Awesome-MLLM-Tuning.

Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

TL;DR

Abstract

Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)

Theorems & Definitions (2)