Table of Contents
Fetching ...

Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim

TL;DR

Visual instruction-tuned multimodal LLMs often lose fine-grained visual information due to text-centric supervision. VIRAL regularizes the visual pathway by aligning internal visual representations with features from pretrained vision foundation models, and optionally with the input encoder, via a cosine-similarity loss added to the standard LM objective. Across diverse benchmarks, VIRAL yields consistent improvements in vision-centric tasks and retains strong multimodal performance, with ablations highlighting the superiority of VFM-targeted alignment (notably using DINOv2) at mid-layer (e.g., layer 16) representations. The approach also accelerates training and improves attention focusing, suggesting a practical and scalable route to enhance visual grounding in MLLMs without extensive retraining. Overall, VIRAL demonstrates a generalizable principle: injecting rich, vision-centric supervision from VFMs into the internal visual pathways strengthens visual understanding in multimodal models.

Abstract

Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

Visual Representation Alignment for Multimodal Large Language Models

TL;DR

Visual instruction-tuned multimodal LLMs often lose fine-grained visual information due to text-centric supervision. VIRAL regularizes the visual pathway by aligning internal visual representations with features from pretrained vision foundation models, and optionally with the input encoder, via a cosine-similarity loss added to the standard LM objective. Across diverse benchmarks, VIRAL yields consistent improvements in vision-centric tasks and retains strong multimodal performance, with ablations highlighting the superiority of VFM-targeted alignment (notably using DINOv2) at mid-layer (e.g., layer 16) representations. The approach also accelerates training and improves attention focusing, suggesting a practical and scalable route to enhance visual grounding in MLLMs without extensive retraining. Overall, VIRAL demonstrates a generalizable principle: injecting rich, vision-centric supervision from VFMs into the internal visual pathways strengthens visual understanding in multimodal models.

Abstract

Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

Paper Structure

This paper contains 46 sections, 8 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: (a) VIsual Representation ALignment (VIRAL) introduces an auxiliary regularization objective on the visual pathway, preventing MLLMs from discarding detailed attributes of the input vision encoder during training while incorporating additional visual knowledge from vision foundation models (VFMs). (b) When trained with DINOv2 oquab2023dinov2 as the VFM, VIRAL consistently yields more accurate visually grounded responses and achieves substantial improvements over standard baselines liu2023llava across diverse vision encoders, including CLIP radford2021clip and SigLIPv2 tschannen2025siglip.
  • Figure 2: Re-injecting or aligning visual features improves representation alignment and performance. (a–c) Comparison of (a) baseline visual instruction tuning liu2023llava, (b) re-injecting visual features, and (c) visual representation alignment, all applied at the 16th layer. (d) Layer-wise alignment between visual tokens in MLLMs and vision encoder features, measured by CKNNA huh2024platonic, with shaded regions denoting middle layers that are particularly important for visual understanding. (e) Benchmark performance corresponding to (a–c).
  • Figure 3: Illustration of VIRAL. We align visual pathway representation from MLLMs to strong, informative representations from VFMs to improve the vision understanding performance of MLLMs.
  • Figure 4: Analysis of attention. Qualitative comparison on text-to-image attention maps (left) and quantified spatial entropy of attention across layers and heads (right). Applying visual representation alignment encourages model to attend to more contextually important content, yielding a more focused and structured attention pattern.
  • Figure 5: Qualitative comparison of baseline and VIRAL. The first column shows the input image–question pairs, and the next two present LLaVA-1.5 and VIRAL results with PCA visualizations and answers. VIRAL yields structured embeddings and correct answers on counting and spatial tasks where the baseline fails.
  • ...and 7 more figures