Table of Contents
Fetching ...

Calibrated Multimodal Representation Learning with Missing Modalities

Xiaohao Liu, Xiaobo Xia, Jiaheng Wei, Shuo Yang, Xiu Su, See-Kiong Ng, Tat-Seng Chua

TL;DR

CalMRL addresses missing modalities in multimodal representation learning by revealing anchor shift as a fundamental limitation of complete-modality alignment and proposing a calibrated framework that imputes missing modalities at the representation level. It introduces a two-step bi-step optimization that alternates between closed-form posterior inference for shared latents and updating generative parameters to produce oracle-like imputed representations, which are concatenated with observed modalities for final alignment. The method provides theoretical guarantees on anchor-shift mitigation and convergence, and demonstrates superior performance on vision-text and audio-text benchmarks under missing-modality conditions. The results show CalMRL’s robustness and practicality for leveraging prevalent incomplete multimodal data, with plans to release code and data.

Abstract

Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL for multimodal representation learning to calibrate incomplete alignments caused by missing modalities. Specifically, CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments and comprehensive analyses demonstrate the superiority of CalMRL. Our code, model checkpoints, and evaluation raw data will be publicly available.

Calibrated Multimodal Representation Learning with Missing Modalities

TL;DR

CalMRL addresses missing modalities in multimodal representation learning by revealing anchor shift as a fundamental limitation of complete-modality alignment and proposing a calibrated framework that imputes missing modalities at the representation level. It introduces a two-step bi-step optimization that alternates between closed-form posterior inference for shared latents and updating generative parameters to produce oracle-like imputed representations, which are concatenated with observed modalities for final alignment. The method provides theoretical guarantees on anchor-shift mitigation and convergence, and demonstrates superior performance on vision-text and audio-text benchmarks under missing-modality conditions. The results show CalMRL’s robustness and practicality for leveraging prevalent incomplete multimodal data, with plans to release code and data.

Abstract

Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL for multimodal representation learning to calibrate incomplete alignments caused by missing modalities. Specifically, CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments and comprehensive analyses demonstrate the superiority of CalMRL. Our code, model checkpoints, and evaluation raw data will be publicly available.

Paper Structure

This paper contains 25 sections, 4 theorems, 28 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathbf{u}_1$ and $\mathbf{u}_1^\Omega$ be the leading left singular vectors of the full multimodal matrix $\mathbf{Z}$ and its observed submatrix $\mathbf{Z}^\Omega$, respectively. Define $\sigma_1 = \|\mathbf{Z}\|_2$, $\sigma_1^\Omega = \|\mathbf{Z}^\Omega\|_2$, and $\eta := \sqrt{\sum_{m \in

Figures (7)

  • Figure 1: Missing modalities result in distorted representation alignment. Different modalities (in green) are aligned together with a virtual anchor (in red) implicitly with all modalities present. With missing modalities, observed ones are enforced to be aligned with a local anchor, deviating from the correct, i.e., anchor shift.
  • Figure 2: The overall framework of CalMRL. Observed unimodal content is first encoded to corresponding representations $\{\mathbf{z}^m\}_{m\in\Omega}$ with individual encoders ${\phi^m}$ in $\boldsymbol{\theta}$. Despite the missing modalities (i.e., $\mathcal{M}/\Omega$), CalMRL calibrates multimodal alignment whereby missing modalities are imputed by generative parameters $\widehat{\boldsymbol{\theta}}$. Finally, $\mathcal{L}_{\text{rep}}$ optimizes the observed unimodal encoder to be aligned with the calibrated direction.
  • Figure 3: MSEs between real and imputed representations. "$\rightarrow$" marks the direction of imputation; Random refers to representations drawn at random.
  • Figure 4: Comparison of anchor shift ($\Delta$) before and after calibration. The left box with a gray border shows the anchor shift with missing modalities (w/o).
  • Figure 5: The performance comparison across missing, calibrated, and full ("ideal") modalities. All the models are trained on MSR-VTT.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Theorem 1: Anchor shift under incomplete modality alignment
  • Proposition 2: Missing modality imputation
  • Corollary 3: Less anchor shift with calibration
  • Corollary 4: Monotonicity for CalMRL imputation
  • proof
  • proof