Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach
Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li
TL;DR
This work introduces an information-theoretic framework for understanding multimodal LLMs under distribution shifts by defining Effective Mutual Information (EMI) as a principled measure of input–output relevance. It develops EMID, an EMI-based bound that quantifies the MLLM performance gap between ID and OOD data in terms of visual/textual input divergences and output-distribution discrepancies, connecting EMI to RP scores and LLM judges. The authors validate the theory across 61 synthetic and natural shift scenarios, show strong correlations between EMI and RP, and demonstrate the practical utility of an EMID upper bound as a regularizer to improve robustness. The framework provides a scalable, theory-grounded approach to assess and improve MLLM reliability in real-world, shift-prone environments, with potential extensions to broader evaluation facets and tighter theoretical bounds.
Abstract
Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights.
