Table of Contents
Fetching ...

How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

Hongxuan Wu, Yukun Zhang, Xueqing Zhou

TL;DR

An information-theoretic, causal account of how vision becomes language in multimodal Transformers is provided, and quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost is offered.

Abstract

When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation -- and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth, language-unique information surges in late layers to account for roughly 82\% of the final prediction, and cross-modal synergy remains below 2\%. This trajectory is highly stable across model variants (layer-wise correlations $>$0.96) yet strongly task-dependent, with semantic redundancy governing the detailed information fingerprint. To establish causality, we perform targeted Image$\rightarrow$Question attention knockouts and show that disrupting the primary transduction pathway induces predictable increases in trapped visual-unique information, compensatory synergy, and total information cost -- effects that are strongest in vision-dependent tasks and weakest in high-redundancy tasks. Together, these results provide an information-theoretic, causal account of how vision becomes language in multimodal Transformers, and offer quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost.

How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

TL;DR

An information-theoretic, causal account of how vision becomes language in multimodal Transformers is provided, and quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost is offered.

Abstract

When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation -- and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth, language-unique information surges in late layers to account for roughly 82\% of the final prediction, and cross-modal synergy remains below 2\%. This trajectory is highly stable across model variants (layer-wise correlations 0.96) yet strongly task-dependent, with semantic redundancy governing the detailed information fingerprint. To establish causality, we perform targeted ImageQuestion attention knockouts and show that disrupting the primary transduction pathway induces predictable increases in trapped visual-unique information, compensatory synergy, and total information cost -- effects that are strongest in vision-dependent tasks and weakest in high-redundancy tasks. Together, these results provide an information-theoretic, causal account of how vision becomes language in multimodal Transformers, and offer quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost.
Paper Structure (95 sections, 3 theorems, 21 equations, 9 figures, 9 tables)

This paper contains 95 sections, 3 theorems, 21 equations, 9 figures, 9 tables.

Key Result

Proposition 4.4

Under non-degenerate conditions ($I_{\mathrm{tot}}^{(L)}>0$ and at least one PID component strictly dominates), the three definitions above describe mutually exclusive regimes in the space of final-layer allocations.

Figures (9)

  • Figure 1: Core research framework.(A) Multimodal input: image and question tokens are processed by a 32-layer LLaVA Transformer. (B) PID Flow estimation: layer-wise representations are compressed via mean pooling and PCA, Gaussianized by a normalizing flow, and decomposed into four PID components ($R, U_V, U_L, S$). (C) Information trajectories reveal a three-stage modal transduction pattern: visual injection (Stage I), consolidation (Stage II), and language-dominant decision formation (Stage III), with $U_L \approx 82\%$ and $S < 2\%$ at the final layer. Cross-model trajectory correlations exceed $r > 0.96$.
  • Figure 2: Layer-wise PID trajectories for LLaVA-1.5-7B across six tasks. Each panel plots redundancy ($R^\ell$), vision-unique ($U_V^\ell$), language-unique ($U_L^\ell$), and synergy ($S^\ell$) as functions of Transformer depth. All tasks share a consistent pattern: $U_V$ peaks early and decays, $U_L$ surges in late layers, and $S$ remains bounded throughout---the signature of modal transduction.
  • Figure 3: Layer-wise PID trajectories for LLaVA-1.5-7B across six tasks. All tasks exhibit decay of $U_V$, a late-stage surge of $U_L$, and bounded synergy $S$, supporting modal transduction as the dominant mechanism.
  • Figure 4: Turning layers and final-layer composition. Transition points ($U_V$ peak, $U_L$ trough, $U_L$ surge onset) cluster across tasks, and final-layer predictive information is dominated by $U_L$.
  • Figure 5: Layer-wise PID trajectories for LLaVA-1.6-7B across six tasks. The depth-wise pattern mirrors LLaVA-1.5-7B (Figure \ref{['fig:pid_llava_all_tasks']}): $U_V$ decays, $U_L$ surges late, and synergy remains bounded, confirming modal transduction in an architecturally distinct model.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Definition 4.1: Persistent synergy
  • Definition 4.2: Modal transduction
  • Definition 4.3: Redundancy-dominant convergence
  • Proposition 4.4: Separability of mechanisms
  • Theorem 4.5: Bijection invariance of mutual information
  • Proposition 4.6: Consistency (informal)