Table of Contents
Fetching ...

CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models

Nan Zhou, Huiqun Wang, Yaoyan Zheng, Di Huang

Abstract

Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outperform the frozen baseline across multimodal tasks. Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. By integrating a Context Vector Extraction (CVE) and a Contextual Mixture-of-Experts (CoMoE) module, CoVFT decomposes conflicting optimization signals and enables stable, context-sensitive visual updates. Extensive experiments on 12 multimodal benchmarks demonstrate that CoVFT achieves state-of-the-art performance with superior stability. Notably, fine-tuning a 7B MLLM with CoVFT surpasses the average performance of its 13B counterpart, revealing substantial untapped potential in visual encoder optimization within MLLMs.

CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models

Abstract

Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outperform the frozen baseline across multimodal tasks. Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. By integrating a Context Vector Extraction (CVE) and a Contextual Mixture-of-Experts (CoMoE) module, CoVFT decomposes conflicting optimization signals and enables stable, context-sensitive visual updates. Extensive experiments on 12 multimodal benchmarks demonstrate that CoVFT achieves state-of-the-art performance with superior stability. Notably, fine-tuning a 7B MLLM with CoVFT surpasses the average performance of its 13B counterpart, revealing substantial untapped potential in visual encoder optimization within MLLMs.
Paper Structure (25 sections, 11 equations, 6 figures, 12 tables)

This paper contains 25 sections, 11 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: (a) Performance comparison of distinct VFT methods on multimodal tasks, showing instability compared with Freeze baseline. (b) Illustration of the visual preference conflict phenomenon. Setup: we construct grounding and captioning tasks using identical images from the Visual Genome Visualgenome dataset and train two independent MLLMs that differ only in their textual queries. (c) The L2 distance between the two vision encoders steadily increases during VFT, indicating growing divergence in learned, especially the deeper representations, which aligns with the classical finding that deeper layers tend to capture information with task-specific preferences yosinski2014transferablelong2015learning.
  • Figure 2: Illustration of the proposed context-aware visual fine-tuning (CoVFT) framework. Contextual vector extraction (CVE) generates a contextual vector $\bm{c}$ by aggregating multimodal cues through text-guided cross-attention. Contextual mixture-of-experts (CoMoE) injects $\bm{c}$ into the vision encoder via context-conditioned expert routing, enabling adaptive visual parameter updates.
  • Figure 3: (a) PCA pca visualization of contextual vectors extracted from the CVE module. A subset of 5,000 instruction samples is clustered via k-means kmeans, showing clear semantic grouping aligned with distinct visual preference patterns. (b) Correlation between contextual similarity and inference similarity, computed as the cosine similarity of routing weights aggregated across CoMoE layers. The strong positive trend ($r=0.76$) indicates that samples with similar contextual vectors yield similar expert activation patterns.
  • Figure 4: Analysis of data scalability. Following the two-stage LLaVA training pipeline, models are fine-tuned with varying proportions of the 665K multimodal instruction dataset during the second-stage instruction-tuning. We compare the performance of the baseline model with a frozen or full fine-tuned vision encoder and our CoVFT framework under different data scales.
  • Figure 5: Cosine similarity between each gradient update and the dominant gradient direction during training under the standard instruction-tuning setting.
  • ...and 1 more figures