Table of Contents
Fetching ...

Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

Hengzhuang Li, Xinsong Zhang, Qiming Peng, Bin Luo, Han Hu, Dengyang Jiang, Han-Jia Ye, Teng Zhang, Hai Jin

TL;DR

The paper tackles modality imbalance in Multimodal LLMs, where visual information is often underutilized due to next-token training. It introduces LaVer, a training framework that performs masked image modeling in the latent space of an LLM using a student-teacher EMA setup and a novel Clipped Gram-Anchoring regularizer to prevent visual feature collapse. By providing direct visual supervision and improved spatial awareness, LaVer yields discriminative latent visual representations and significantly boosts dense-visual benchmarks (notably OCR and vision-centric tasks) while preserving language capabilities. The approach is architecture-agnostic, scalable across model and data sizes, and complementary to other visual enhancement methods, marking a practical path toward truly integrated vision-language understanding.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks. Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized compared to textual representations in deeper layers, leading to degraded visual performance or hallucinations. This issue stems from the predominant reliance on next-text-token-prediction during training, which fails to provide direct visual supervisory signals, resulting in progressive homogenization of visual representations throughout the layers. To this end, we propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discriminative visual representations via masked image modeling in the joint latent semantic space of LLM. Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information. Extensive experiments across diverse benchmarks prove the superiority of our approach in various scenarios, especially those requiring dense visual capabilities. Code of LaVer is available at https://github.com/Fir-lat/LaVer.

Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

TL;DR

The paper tackles modality imbalance in Multimodal LLMs, where visual information is often underutilized due to next-token training. It introduces LaVer, a training framework that performs masked image modeling in the latent space of an LLM using a student-teacher EMA setup and a novel Clipped Gram-Anchoring regularizer to prevent visual feature collapse. By providing direct visual supervision and improved spatial awareness, LaVer yields discriminative latent visual representations and significantly boosts dense-visual benchmarks (notably OCR and vision-centric tasks) while preserving language capabilities. The approach is architecture-agnostic, scalable across model and data sizes, and complementary to other visual enhancement methods, marking a practical path toward truly integrated vision-language understanding.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks. Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized compared to textual representations in deeper layers, leading to degraded visual performance or hallucinations. This issue stems from the predominant reliance on next-text-token-prediction during training, which fails to provide direct visual supervisory signals, resulting in progressive homogenization of visual representations throughout the layers. To this end, we propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discriminative visual representations via masked image modeling in the joint latent semantic space of LLM. Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information. Extensive experiments across diverse benchmarks prove the superiority of our approach in various scenarios, especially those requiring dense visual capabilities. Code of LaVer is available at https://github.com/Fir-lat/LaVer.

Paper Structure

This paper contains 22 sections, 14 equations, 11 figures, 17 tables.

Figures (11)

  • Figure 1: Benchmark Performance. LaVer consistently outperforms the baseline across diverse benchmarks, especially on dense visual tasks such as OCRB Liu_2024 and CQA masry-etal-2022-chartqa. The results are obtained with SigLIP 2 tschannen2025siglip2multilingualvisionlanguage and Qwen2.5-7B-Instruct qwen2025qwen25technicalreport.
  • Figure 2: Progressive visual representation homogenization.(a) presents higher feature cosine similarities of the last layer than middel layer. (b-c) display the t-SNE visualizations of the output embeddings. (d) quantifies the averaged vision cosine similairity. (e) quantifies the allocated attention score for vision tokens. LaVer outputs discriminative visual representations with higher attention allocation. The quantitative results are obtained with with SigLIP 2 tschannen2025siglip2multilingualvisionlanguage and Qwen2.5-7B-Instruct qwen2025qwen25technicalreport, averaged across the images from MMVP 10655378.
  • Figure 3: Overview of LaVer.(a) depicts the student-teacher framework where the MLLM is trained to predict the teacher's visual output embeddings for the masked positions, regularized by the Clipped Gram-Anchoring to prevent feature inconsistency. (b) depicts the mixed attention mechanism. (c) depicts the 2D-ROPE mechanism. LaVer learns discriminative visual representations by self-supervised MIM.
  • Figure 4: Effects of MIM on visual feature consistency.(a) illustrates PCA visualization of visual features with different components. (b) illustrates the averaged cosine similarity between vision tokens along training. Our method displays the most discrminative features.
  • Figure 5: Scaling properties & Ablation studies.(a) Parameter Scaling of LaVer. (b) Data Scaling of LaVer. (c) Ablation on masking strategies. (d) Ablation on EMA updating strategies. LaVer displays significant scaling properties and robustness of hyperparameters.
  • ...and 6 more figures