Table of Contents
Fetching ...

Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach

Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li

TL;DR

This work introduces an information-theoretic framework for understanding multimodal LLMs under distribution shifts by defining Effective Mutual Information (EMI) as a principled measure of input–output relevance. It develops EMID, an EMI-based bound that quantifies the MLLM performance gap between ID and OOD data in terms of visual/textual input divergences and output-distribution discrepancies, connecting EMI to RP scores and LLM judges. The authors validate the theory across 61 synthetic and natural shift scenarios, show strong correlations between EMI and RP, and demonstrate the practical utility of an EMID upper bound as a regularizer to improve robustness. The framework provides a scalable, theory-grounded approach to assess and improve MLLM reliability in real-world, shift-prone environments, with potential extensions to broader evaluation facets and tighter theoretical bounds.

Abstract

Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights.

Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach

TL;DR

This work introduces an information-theoretic framework for understanding multimodal LLMs under distribution shifts by defining Effective Mutual Information (EMI) as a principled measure of input–output relevance. It develops EMID, an EMI-based bound that quantifies the MLLM performance gap between ID and OOD data in terms of visual/textual input divergences and output-distribution discrepancies, connecting EMI to RP scores and LLM judges. The authors validate the theory across 61 synthetic and natural shift scenarios, show strong correlations between EMI and RP, and demonstrate the practical utility of an EMID upper bound as a regularizer to improve robustness. The framework provides a scalable, theory-grounded approach to assess and improve MLLM reliability in real-world, shift-prone environments, with potential extensions to broader evaluation facets and tighter theoretical bounds.

Abstract

Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights.

Paper Structure

This paper contains 52 sections, 15 theorems, 68 equations, 7 figures, 10 tables.

Key Result

Lemma 4.3

Given a distribution $P_{\mathbf{X}Y}$ and an MLLM $P_{\theta}$, if $\mathbb{E}_{\mathbf{x}\sim P_{\mathbf{X}}} [D_{\rm KL}(P_{\theta}(\cdot|\mathbf{x})\| P_{Y|\mathbf{X}=\mathbf{x}})] \leq \delta$, and let the reward function $r(\mathbf{x},y)$ be $\log P_{Y|\mathbf{X}=\mathbf{x}}(y)$, then

Figures (7)

  • Figure 1: Performance variation against varying distribution shifts. We evaluated LLaVA v1.5 (top) and LLaVA NeXT (bottom) models on 27 out-of-distribution (OOD) variants of the LLaVA-Bench COCO (ID). Here, the $x$-axis is sorted by the severity of shifts between ID and OOD. There is a consistent trend, increased degrees of distribution shifts result in performance degradations of MLLM.
  • Figure 2: Types of distribution shifts between train and evaluation of MLLMs. We simulate visual, text, and joint shifts by controlling the shift of each input modality.
  • Figure 3: Scatter plot with regression line between empirical estimates of EMID and its upper bound. Over the 34 synthetic and 27 natural distribution shift scenarios, we evaluate four MLLMs and get 136 cases and 108 cases of synthetic shifts and natural shifts, respectively, for visualizing EMID and its scale-adjusted upper bound estimates (See Appendix \ref{['appendix:implementation_details']} for details). The two panels on the left show results for all four models, whereas the right ones distinguish them per model with fitted linear regression coefficients (Slope).
  • Figure 4: Scatter plot with regression line between empirical estimates of EMID and partial components of its upper bound. We remove the $\Delta$ term of bound (Eq. \ref{['eq:emid_bound_simple']}) and only use the estimates of JSD terms over visual and text inputs.
  • Figure 5: Information diagram and motivation of effective mutual information. The difference between vanilla MI terms does not consider the domain-dependent intrinsic scale and mutual information, thereby failing to fairly measure the relevance between input query $x$ and model prediction $\hat{y}$. Meanwhile, EMI ablates the domain-dependent characteristic to focus on measuring effective relevance between $x$ and $\hat{y}$.
  • ...and 2 more figures

Theorems & Definitions (28)

  • Definition 2.1: Relative Preference Score
  • Definition 4.1: Mutual Information (MI)
  • Definition 4.2: Effective Mutual Information (EMI)
  • Lemma 4.3
  • Theorem 4.4
  • Theorem 4.5: Simplified Scenario
  • Theorem 4.6: General Scenario
  • Lemma 4.1: Restatment of Lemma \ref{['Main-thm1-lemma']}
  • proof
  • Theorem 4.2: Restatment of Lemma \ref{['Main-thm1-thm']}
  • ...and 18 more