Understanding Unimodal Bias in Multimodal Deep Linear Networks

Yedi Zhang; Peter E. Latham; Andrew Saxe

Understanding Unimodal Bias in Multimodal Deep Linear Networks

Yedi Zhang, Peter E. Latham, Andrew Saxe

TL;DR

This work investigates unimodal bias in multimodal learning by analyzing multimodal deep linear networks across early, intermediate, and late fusion schemes. By deriving gradient-descent dynamics and fixed-point structures, it quantifies the duration of the unimodal phase as a function of fusion depth $L_f$, dataset correlations, and initialization, showing that deeper fusion prolongs unimodal learning and can cause permanent bias in overparameterized regimes. The results extend to certain nonlinear networks (e.g., two-layer ReLU with linear targets) and are validated through simulations and MNIST experiments, offering practical guidance on fusion-depth choices. Overall, the paper sheds light on pathologies of joint multimodal training and provides concrete analytic tools to diagnose and mitigate unimodal bias in real-world architectures.

Abstract

Using multiple input streams simultaneously to train multimodal neural networks is intuitively advantageous but practically challenging. A key challenge is unimodal bias, where a network overly relies on one modality and ignores others during joint training. We develop a theory of unimodal bias with multimodal deep linear networks to understand how architecture and data statistics influence this bias. This is the first work to calculate the duration of the unimodal phase in learning as a function of the depth at which modalities are fused within the network, dataset statistics, and initialization. We show that the deeper the layer at which fusion occurs, the longer the unimodal phase. A long unimodal phase can lead to a generalization deficit and permanent unimodal bias in the overparametrized regime. Our results, derived for multimodal linear networks, extend to nonlinear networks in certain settings. Taken together, this work illuminates pathologies of multimodal learning under joint training, showing that late and intermediate fusion architectures can give rise to long unimodal phases and permanent unimodal bias. Our code is available at: https://yedizhang.github.io/unimodal-bias.html.

Understanding Unimodal Bias in Multimodal Deep Linear Networks

TL;DR

, dataset correlations, and initialization, showing that deeper fusion prolongs unimodal learning and can cause permanent bias in overparameterized regimes. The results extend to certain nonlinear networks (e.g., two-layer ReLU with linear targets) and are validated through simulations and MNIST experiments, offering practical guidance on fusion-depth choices. Overall, the paper sheds light on pathologies of joint multimodal training and provides concrete analytic tools to diagnose and mitigate unimodal bias in real-world architectures.

Abstract

Paper Structure (49 sections, 64 equations, 14 figures, 1 table)

This paper contains 49 sections, 64 equations, 14 figures, 1 table.

Introduction
Related Work
Problem Setup
Multimodal Data
Multimodal Fusion Linear Network
Gradient Descent Dynamics
Two-Layer Multimodal Linear Networks
Loss Landscape
Duration of the Unimodal Phase
Mis-attribution in the Unimodal Phase
Superficial Modality Preference
Underparameterization and Overparameterization
Deep Multimodal Linear Networks
Loss Landscape
Duration of the Unimodal Phase
...and 34 more sections

Figures (14)

Figure 1: Schematic of a multimodal fusion network with total depth $L$ and fusion layer at $L_f$.
Figure 2: Effect of fusion point on learning dynamics and loss landscape. Top row: Early fusion. Bottom row: Late fusion. Both networks are trained with the same dataset. (a,d) Network schematic. (b,e) Training trajectories. (c,f) Phase portrait. Late fusion introduces two manifolds of saddles (blue and magenta crosses) into the loss landscape, causing learning trajectories to plateau near a unimodal solution. Experimental details are provided in \ref{['supp:implementation']}.
Figure 3: Duration of unimodal phase and amount of mis-attribution in two-layer late fusion linear networks. We consider scalar inputs ${\bm{x}}_\textrm{A},{\bm{x}}_\textrm{B} \in {\mathbb{R}}$ with input covariance matrix parameterized as ${\bm{\Sigma}} = \left[\sigma_\textrm{A}^2, \rho \sigma_\textrm{A} \sigma_\textrm{B}; \rho \sigma_\textrm{B} \sigma_\textrm{A}, \sigma_\textrm{B}^2 \right]$. The target output is generated as $y={\bm{x}}_\textrm{A}+{\bm{x}}_\textrm{B}$. (a) Loss and total weight trajectories in two-layer late fusion networks when modalities are positively correlated. (b) Same as panel a but for negative correlations. (c) Time ratio $t_\textrm{B}/t_\textrm{A}$ as in \ref{['eq:timeratio-2L']}. (d) Amount of mis-attribution. In panel c and d, lines are theoretical predictions; circles are simulations of two-layer late fusion linear networks; crosses are simulations of two-layer late fusion ReLU networks. Experimental details are provided in \ref{['supp:implementation']}.
Figure 4: Demonstration of superficial modality preference. A two-layer late fusion linear network is trained with two different dataset. (a,b) In both examples, modality A is learned first. The dotted black line marks the loss when the network visits ${\mathcal{M}}_\textrm{A}$. The dotted gray line marks the loss if the network had instead visited ${\mathcal{M}}_\textrm{B}$. (a) The prioritized modality is not the modality that contributes more to the output. (b) The prioritized modality is the modality that contributes more to the output. (c) Boundaries of which modality is prioritized and which modality contributes more to the output in terms of dataset statistics. In region I and III, modality A is learned first. In region I and II, modality A contributes more to the output. Thus in region II and III (shaded red), prioritization and contribution disagree, resulting in superficial modality preference. Experimental details are provided in \ref{['supp:implementation']}.
Figure 5: Overparameterized and underparameterized two-layer early and late fusion linear networks. Inputs are 50-dimensional, i.e., ${\bm{x}}_\textrm{A},{\bm{x}}_\textrm{B} \in {\mathbb{R}}^{50}$. (a) Loss and generalization error trajectories of a two-layer early fusion linear network trained with 700 examples. (b) Same as panel a but with late fusion. (c) Loss and generalization error trajectories of a two-layer early fusion linear network trained with 70 examples. (d) Same as panel c but with late fusion. The dotted gray line marks the lowest generalization error that a unimodal network could achieve with the same dataset. Experimental details are given in \ref{['supp:implementation']}.
...and 9 more figures

Understanding Unimodal Bias in Multimodal Deep Linear Networks

TL;DR

Abstract

Understanding Unimodal Bias in Multimodal Deep Linear Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (14)