Table of Contents
Fetching ...

Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

Enguang Wang, Qiang Wang, Yuanchen Wu, Ke Yan, Xinbin Yuan, Shouhong Ding, Xialei Liu, Ming-Ming Cheng

Abstract

While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.

Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

Abstract

While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.
Paper Structure (14 sections, 7 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 14 sections, 7 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: The Proposed PRe Framework. PRe enforces the degraded representations to predict their initial, clean anchor representations via a lightweight prediction head.
  • Figure 2: Linear probe results on global visual features. Relative to the initial representation (Layer 0), a consistent performance drop is observed in the intermediate layers.
  • Figure 3: Evolution of patch-level semantics. The intra- and inter-object similarities rise through most layers, leading to reduce the semantic contrast ratio, revealing patch structure degradation.
  • Figure 4: The cosine similarity between the patch noted in green and all other patches. Compared with the initial representation, the middle layer blurs the visual semantic boundaries. (Images are from opensource COCO-stuff caesar2018cocostuffthingstuffclasses dataset)
  • Figure 5: The statistical properties of the global representations. The increased PCA effective dimension and reduced mean off-diagonal correlation indicate that the representations become both more geometrically complex and statistically independent.
  • ...and 8 more figures