Table of Contents
Fetching ...

A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

Lixin Xiu, Xufang Luo, Hideki Nakayama

Abstract

Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs -- decomposing a model's decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions -- breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid-lvlm-analysis .

A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

Abstract

Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs -- decomposing a model's decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions -- breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid-lvlm-analysis .

Paper Structure

This paper contains 54 sections, 10 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Overview of this research. The first part is the framework of PID estimation for LVLMs. Given an image-text pair, we extract image and text embeddings as two features, run a standard multimodal forward pass and collect two unimodal predictions by masking the other modality. PID values are estimated with BATCH estimator. The second part reveals three analysis dimensions: (1) cross-model and cross-task comparison, (2) layer-wise information dynamics, and (3) learning dynamics over training. To our knowledge, this is the first comprehensive LVLM analysis through the lens of information decomposition.
  • Figure 2: Share of synergy $S$ and language uniqueness $U_2$ across four datasets.
  • Figure 3: Family-level strategies: median $S$ versus median $U_2$ per family, computed across model sizes within each task regime. Points show the family medians for each regime. Outliers (InstructBLIP, Fuyu) are omitted for clarity.
  • Figure 4: Layer-wise PID dynamics for representative models on synergy-driven (MMBench, top) and knowledge-driven (PMC-VQA, bottom) tasks. A consistent three-phase pattern appears across models and datasets.
  • Figure 5: Evolution of $S$ and $U_2$ during two-stage training of LLaVA-1.5 (7B, 13B).
  • ...and 10 more figures