Understanding Unimodal Bias in Multimodal Deep Linear Networks
Yedi Zhang, Peter E. Latham, Andrew Saxe
TL;DR
This work investigates unimodal bias in multimodal learning by analyzing multimodal deep linear networks across early, intermediate, and late fusion schemes. By deriving gradient-descent dynamics and fixed-point structures, it quantifies the duration of the unimodal phase as a function of fusion depth $L_f$, dataset correlations, and initialization, showing that deeper fusion prolongs unimodal learning and can cause permanent bias in overparameterized regimes. The results extend to certain nonlinear networks (e.g., two-layer ReLU with linear targets) and are validated through simulations and MNIST experiments, offering practical guidance on fusion-depth choices. Overall, the paper sheds light on pathologies of joint multimodal training and provides concrete analytic tools to diagnose and mitigate unimodal bias in real-world architectures.
Abstract
Using multiple input streams simultaneously to train multimodal neural networks is intuitively advantageous but practically challenging. A key challenge is unimodal bias, where a network overly relies on one modality and ignores others during joint training. We develop a theory of unimodal bias with multimodal deep linear networks to understand how architecture and data statistics influence this bias. This is the first work to calculate the duration of the unimodal phase in learning as a function of the depth at which modalities are fused within the network, dataset statistics, and initialization. We show that the deeper the layer at which fusion occurs, the longer the unimodal phase. A long unimodal phase can lead to a generalization deficit and permanent unimodal bias in the overparametrized regime. Our results, derived for multimodal linear networks, extend to nonlinear networks in certain settings. Taken together, this work illuminates pathologies of multimodal learning under joint training, showing that late and intermediate fusion architectures can give rise to long unimodal phases and permanent unimodal bias. Our code is available at: https://yedizhang.github.io/unimodal-bias.html.
