Table of Contents
Fetching ...

LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering

Jinhe Bi, Yujun Wang, Haokun Chen, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma

TL;DR

This work addresses intrinsic modality imbalance in Multimodal Large Language Models by showing that text often dominates visual instruction tuning. It introduces Modality Linear Representation-Steering (MoReS), which steers visual representations in a reduced subspace via a linear transformation while keeping the LLM frozen, achieving comparable visual-task performance with orders of magnitude fewer trainable parameters ($O(Dd)$) than full fine-tuning ($O(D^2)$). The LLaVA Steering models (3B/7B/13B) demonstrate strong results across visual benchmarks and VQA tasks, with parameter- efficiency improvements ranging from 287x to 1150x relative to LoRA, and ablations confirm the effectiveness of a 1% steered-token ratio and a rank-1 subspace. To support the research community, the authors also release the LLaVA Steering Factory, a modular platform enabling standardized training, evaluation, and modality-imbalance analysis across diverse MLLMs. These contributions collectively offer a practical pathway to scalable, visually grounded language understanding with greatly reduced training overhead.

Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced visual tasks by integrating visual representations into large language models (LLMs). The textual modality, inherited from LLMs, equips MLLMs with abilities like instruction following and in-context learning. In contrast, the visual modality enhances performance in downstream tasks by leveraging rich semantic content, spatial information, and grounding capabilities. These intrinsic modalities work synergistically across various visual tasks. Our research initially reveals a persistent imbalance between these modalities, with text often dominating output generation during visual instruction tuning. This imbalance occurs when using both full fine-tuning and parameter-efficient fine-tuning (PEFT) methods. We then found that re-balancing these modalities can significantly reduce the number of trainable parameters required, inspiring a direction for further optimizing visual instruction tuning. We introduce Modality Linear Representation-Steering (MoReS) to achieve the goal. MoReS effectively re-balances the intrinsic modalities throughout the model, where the key idea is to steer visual representations through linear transformations in the visual subspace across each model layer. To validate our solution, we composed LLaVA Steering, a suite of models integrated with the proposed MoReS method. Evaluation results show that the composed LLaVA Steering models require, on average, 500 times fewer trainable parameters than LoRA needs while still achieving comparable performance across three visual benchmarks and eight visual question-answering tasks. Last, we present the LLaVA Steering Factory, an in-house developed platform that enables researchers to quickly customize various MLLMs with component-based architecture for seamlessly integrating state-of-the-art models, and evaluate their intrinsic modality imbalance.

LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering

TL;DR

This work addresses intrinsic modality imbalance in Multimodal Large Language Models by showing that text often dominates visual instruction tuning. It introduces Modality Linear Representation-Steering (MoReS), which steers visual representations in a reduced subspace via a linear transformation while keeping the LLM frozen, achieving comparable visual-task performance with orders of magnitude fewer trainable parameters () than full fine-tuning (). The LLaVA Steering models (3B/7B/13B) demonstrate strong results across visual benchmarks and VQA tasks, with parameter- efficiency improvements ranging from 287x to 1150x relative to LoRA, and ablations confirm the effectiveness of a 1% steered-token ratio and a rank-1 subspace. To support the research community, the authors also release the LLaVA Steering Factory, a modular platform enabling standardized training, evaluation, and modality-imbalance analysis across diverse MLLMs. These contributions collectively offer a practical pathway to scalable, visually grounded language understanding with greatly reduced training overhead.

Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced visual tasks by integrating visual representations into large language models (LLMs). The textual modality, inherited from LLMs, equips MLLMs with abilities like instruction following and in-context learning. In contrast, the visual modality enhances performance in downstream tasks by leveraging rich semantic content, spatial information, and grounding capabilities. These intrinsic modalities work synergistically across various visual tasks. Our research initially reveals a persistent imbalance between these modalities, with text often dominating output generation during visual instruction tuning. This imbalance occurs when using both full fine-tuning and parameter-efficient fine-tuning (PEFT) methods. We then found that re-balancing these modalities can significantly reduce the number of trainable parameters required, inspiring a direction for further optimizing visual instruction tuning. We introduce Modality Linear Representation-Steering (MoReS) to achieve the goal. MoReS effectively re-balances the intrinsic modalities throughout the model, where the key idea is to steer visual representations through linear transformations in the visual subspace across each model layer. To validate our solution, we composed LLaVA Steering, a suite of models integrated with the proposed MoReS method. Evaluation results show that the composed LLaVA Steering models require, on average, 500 times fewer trainable parameters than LoRA needs while still achieving comparable performance across three visual benchmarks and eight visual question-answering tasks. Last, we present the LLaVA Steering Factory, an in-house developed platform that enables researchers to quickly customize various MLLMs with component-based architecture for seamlessly integrating state-of-the-art models, and evaluate their intrinsic modality imbalance.

Paper Structure

This paper contains 22 sections, 15 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Left: Attention score distributions across layers for three MLLM fine-tuning methods (Full, LoRA, and MoReS), sampled from 100 instances each. Green represents visual representations, while grey indicates other (primarily textual) representations. Full fine-tuning and LoRA show strong reliance on textual representations across most layers. In contrast, the proposed MoReS method demonstrates significantly improved visual representation utilization, particularly in the middle and lower layers, addressing the intrinsic modality imbalance in MLLMs. Right: Average visual attention score distribution versus model size for different MLLM fine-tuning methods. The plot suggests that methods achieving better balanced intrinsic modality tend to require fewer trainable parameters.
  • Figure 2: Layer-wise Modality Attention Ratio (LMAR) comparison across training methods, including Full fine-tuning, LoRA, Adapter, IA3, and our MoReS. Our MoReS method (red line) consistently demonstrates the highest LMAR across most layers, with a notable spike in the final layers. Compared with full fine-tuning and mainstream PEFT methods, our MoReS needs the least parameters during visual instruction tuning while achieving superior modality balance.
  • Figure 3: Schematic Overview of Modality Linear Representation-Steering (MoReS): Left: The architectural diagram depicts the integration of textual and visual tokens through transformer layers, leading to output token generation. Right: The mathematical formulation of MoReS illustrates the steering of visual representations within a subspace, highlighting its impact on output generation. During visual instruction tuning, the parameters of the LLM remain frozen, allowing only the parameters associated with the linear transformation in the steering mechanism to be trainable. With MoReS, the distribution of attention scores becomes more balanced, achieving intrinsic modality balance.
  • Figure 4: Comparison of parameter count vs. performance for MoReS and other PEFT methods across four benchmarks.
  • Figure 5: Architectural overview of the proposed LLaVA Steering Factory: A Modular Codebase for MLLMs.
  • ...and 3 more figures