Table of Contents
Fetching ...

Re-Imagining Multimodal Instruction Tuning: A Representation View

Yiyang Liu, James Chenhao Liang, Ruixiang Tang, Yugyung Lee, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Lifu Huang, Dongfang Liu, Qifan Wang, Cheng Han

TL;DR

The paper tackles the challenge of zero-shot generalization in large multimodal models without incurring the heavy costs of full fine-tuning. It introduces Multimodal Representation Tuning (MRT), a framework that freezes the backbone and learns lightweight, low-rank representation editors to modify visual, cross-modality, and textual multimodal representations, enabling substantial performance gains with minimal parameter updates. MRT achieves state-of-the-art results on the MME benchmark, approaches full fine-tuning performance with only a fraction of trainable parameters, and demonstrates token-level controllability for robust, counterfactual manipulation of outputs. The work contributes a principled, interpretable approach to multimodal PEFT, provides extensive ablations on rank and editing strategies, and opens avenues for safer, more controllable multimodal systems in practical settings.

Abstract

Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to reduce the number of tunable parameters, a significant performance gap remains compared to full fine-tuning. Furthermore, existing PEFT approaches are often highly parameterized, making them difficult to interpret and control. In light of this, we introduce Multimodal Representation Tuning (MRT), a novel approach that focuses on directly editing semantically rich multimodal representations to achieve strong performance and provide intuitive control over LMMs. Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.

Re-Imagining Multimodal Instruction Tuning: A Representation View

TL;DR

The paper tackles the challenge of zero-shot generalization in large multimodal models without incurring the heavy costs of full fine-tuning. It introduces Multimodal Representation Tuning (MRT), a framework that freezes the backbone and learns lightweight, low-rank representation editors to modify visual, cross-modality, and textual multimodal representations, enabling substantial performance gains with minimal parameter updates. MRT achieves state-of-the-art results on the MME benchmark, approaches full fine-tuning performance with only a fraction of trainable parameters, and demonstrates token-level controllability for robust, counterfactual manipulation of outputs. The work contributes a principled, interpretable approach to multimodal PEFT, provides extensive ablations on rank and editing strategies, and opens avenues for safer, more controllable multimodal systems in practical settings.

Abstract

Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to reduce the number of tunable parameters, a significant performance gap remains compared to full fine-tuning. Furthermore, existing PEFT approaches are often highly parameterized, making them difficult to interpret and control. In light of this, we introduce Multimodal Representation Tuning (MRT), a novel approach that focuses on directly editing semantically rich multimodal representations to achieve strong performance and provide intuitive control over LMMs. Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.

Paper Structure

This paper contains 30 sections, 4 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: MRT (ours) $v.s.$ concurrent arts. Our method yields significant performance gains over state-of-the-art multimodal PEFT approaches on MME and MMAvg benchmarks with considerably lower parameter usage (see Table \ref{['table:main-results']}).
  • Figure 2: Overview of MRT. Representation editors $\psi \in \left\{\psi_V, \psi_c, \psi_P, \psi_S\right\}$ are the only tunable parameters while the entire model remains completely frozen. During fine-tuning, we jointly edit the visual representations in the vision encoder, the cross-modality layer, and the prefix and suffix of textual-oriented fraction in the multimodal representations in the LLM. These editors efficiently and effectively optimize the model representations during multimodal instruction tuning.
  • Figure 3: Controllabilty Pipeline on Image Classification. MRT offers LMM controllability from a representation perspective, allowing for direct editing of representations with semantic meanings and enabling counterfactual interference with the results. Details are shown in §\ref{['subsec:Controllability_Experiment']}.
  • Figure 4: Impact of Rank. Each cell in the map corresponds to the evaluation score of a model with a multimodal rank (row) and a visual rank (column). A darker hue represents a higher score, whereas a lighter hue indicates a lower score.
  • Figure 5: Loss Landscape along two random directions. The top three surfaces represent the loss landscape, while the bottom three are the 2-d heat maps.
  • ...and 4 more figures