Table of Contents
Fetching ...

Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

Jayadev Billa

TL;DR

A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility without affecting other attributes, confirming that the training objective determines what becomes accessible.

Abstract

Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3--55$\times$ above chance in linear probes), yet removing 64--71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder's scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ($+$7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.

Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

TL;DR

A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility without affecting other attributes, confirming that the training objective determines what becomes accessible.

Abstract

Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3--55 above chance in linear probes), yet removing 64--71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder's scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility (7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.
Paper Structure (33 sections, 3 theorems, 12 equations, 2 figures, 11 tables)

This paper contains 33 sections, 3 theorems, 12 equations, 2 figures, 11 tables.

Key Result

Theorem 1

Under i.i.d. sampling from $P_{\!M}$ and standard measurability conditions, the maximum rate extractable by the fixed decoder $q_\psi$ is $R_{\mathrm{acc}}(P_{\!M}, q_\psi) = \mathrm{GMI}_{P_{\!M}}(q_\psi)$.

Figures (2)

  • Figure 1: Probe accuracy trajectories for all five models. Rows 1--2: speech models (Ultravox, Qwen2-Audio) across LibriSpeech, CREMA-D, and ESC-50. Row 3: vision models (LLaVA, Prismatic-DINOv2, Prismatic-SigLIP) on COCO.
  • Figure 2: Mode alignment profiles for all five models. Blue: alignment score $\tilde{\alpha}(u_k)$; orange: eigenvalue spectrum $\lambda_k$ (log scale). The dominant eigenmodes are modality-specific ($\tilde{\alpha} \approx 0$) for all models with non-text-aligned encoders; for Prismatic-S (SigLIP, text-aligned), even Mode 0 is text-aligned ($\tilde{\alpha} = 0.83$).

Theorems & Definitions (3)

  • Theorem 1: Accessible rate = GMI; following Scarlett2020
  • Theorem 2: GMI-Wasserstein bound
  • Theorem 3: Probe penalty