Visual Representation Alignment for Multimodal Large Language Models
Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim
TL;DR
Visual instruction-tuned multimodal LLMs often lose fine-grained visual information due to text-centric supervision. VIRAL regularizes the visual pathway by aligning internal visual representations with features from pretrained vision foundation models, and optionally with the input encoder, via a cosine-similarity loss added to the standard LM objective. Across diverse benchmarks, VIRAL yields consistent improvements in vision-centric tasks and retains strong multimodal performance, with ablations highlighting the superiority of VFM-targeted alignment (notably using DINOv2) at mid-layer (e.g., layer 16) representations. The approach also accelerates training and improves attention focusing, suggesting a practical and scalable route to enhance visual grounding in MLLMs without extensive retraining. Overall, VIRAL demonstrates a generalizable principle: injecting rich, vision-centric supervision from VFMs into the internal visual pathways strengthens visual understanding in multimodal models.
Abstract
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.
