Law of Vision Representation in MLLMs

Shijia Yang; Bohan Zhai; Quanzeng You; Jianbo Yuan; Hongxia Yang; Chenfeng Xu

Law of Vision Representation in MLLMs

Shijia Yang, Bohan Zhai, Quanzeng You, Jianbo Yuan, Hongxia Yang, Chenfeng Xu

TL;DR

The paper addresses the costly, empirical process of selecting vision representations for MLLMs by proposing the Law of Vision Representation, which links performance to cross-modal Alignment $A$ and visual Correspondence $C$ via a quadratic relation. It formalizes $A$ and $C$ into measurable scores and combines them into the AC Score, demonstrating a strong $R^2$ fit (approximately $0.9406$) across diverse vision representations and benchmarks. To enable scalable selection, it introduces the AC policy, a region-based sampling method that identifies top representations with only a fraction of full LLM finetuning (Recall@3 around $87.72\\%$ on average, up to $91.7\%$ for OKVQA) and yields up to a $99.7\%$ reduction in compute. These results offer a principled, data-efficient approach to vision-language alignment in MLLMs, facilitating more scalable and cost-effective development.

Abstract

We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

Law of Vision Representation in MLLMs

TL;DR

The paper addresses the costly, empirical process of selecting vision representations for MLLMs by proposing the Law of Vision Representation, which links performance to cross-modal Alignment

and visual Correspondence

via a quadratic relation. It formalizes

and

into measurable scores and combines them into the AC Score, demonstrating a strong

fit (approximately

) across diverse vision representations and benchmarks. To enable scalable selection, it introduces the AC policy, a region-based sampling method that identifies top representations with only a fraction of full LLM finetuning (Recall@3 around

on average, up to

for OKVQA) and yields up to a

reduction in compute. These results offer a principled, data-efficient approach to vision-language alignment in MLLMs, facilitating more scalable and cost-effective development.

Abstract

Paper Structure (35 sections, 7 equations, 10 figures, 6 tables, 3 algorithms)

This paper contains 35 sections, 7 equations, 10 figures, 6 tables, 3 algorithms.

Introduction
Related Works
Vision Representations for MLLMs
Cross-modal Alignment
Visual Correspondence
Law of Vision Representation in MLLMs
Assumptions
Theoretical Justification
Empirical Justification
Results.
AC Policy
Problem Formulation.
Policy Fitting.
Sampling Strategy.
Results.
...and 20 more sections

Figures (10)

Figure 1: Visualization of the Law of Vision Representation in MLLMs.
Figure 2: $R^2$ values for regression models fitted on various scores.
Figure 3: Overall framework of AC policy.
Figure 4: Given a limited budget of 4 finetunings, AC policy achieves 87.72% Recall@3 in predicting the optimal vision representation.
Figure 5: Number of full training (LLM finetuning) cycles required to include the optimal vision representation within the top-3 predictions (Recall@3).
...and 5 more figures

Law of Vision Representation in MLLMs

TL;DR

Abstract

Law of Vision Representation in MLLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (10)