Table of Contents
Fetching ...

Robust Multimodal Learning via Cross-Modal Proxy Tokens

Md Kaykobad Reza, Ameya Patil, Mashhour Solh, M. Salman Asif

TL;DR

This work tackles robustness to missing modalities in multimodal learning by introducing Cross-Modal Proxy Tokens (CMPTs). CMPTs are learned alongside frozen unimodal encoders using lightweight low-rank adapters and an alignment loss to proxy the class token of the missing modality from the available one, with a gating fusion mechanism to combine modalities. Across five diverse datasets, CMPTs achieve state-of-the-art robustness to missing modalities while preserving or improving performance when all modalities are present, demonstrating both effectiveness and efficiency. The approach generalizes across architectures and misses, and supports parameter-efficient adaptation, making it practical for real-world multimodal tasks.

Abstract

Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning. The code for this paper is available at: https://github.com/CSIPlab/Cross-Modal-Proxy-Tokens.

Robust Multimodal Learning via Cross-Modal Proxy Tokens

TL;DR

This work tackles robustness to missing modalities in multimodal learning by introducing Cross-Modal Proxy Tokens (CMPTs). CMPTs are learned alongside frozen unimodal encoders using lightweight low-rank adapters and an alignment loss to proxy the class token of the missing modality from the available one, with a gating fusion mechanism to combine modalities. Across five diverse datasets, CMPTs achieve state-of-the-art robustness to missing modalities while preserving or improving performance when all modalities are present, demonstrating both effectiveness and efficiency. The approach generalizes across architectures and misses, and supports parameter-efficient adaptation, making it practical for real-world multimodal tasks.

Abstract

Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning. The code for this paper is available at: https://github.com/CSIPlab/Cross-Modal-Proxy-Tokens.

Paper Structure

This paper contains 36 sections, 7 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: (a) We introduce Cross-Modal Proxy Tokens (CMPTs), a novel approach to address missing modality challenges. CMPTs effectively learn to approximate missing modality class tokens by adapting pretrained encoders through a joint optimization of alignment and task-specific objectives. Our approach accommodates both complete and missing modalities during training and inference, thereby enhancing robustness across varying missing modality scenarios. (b) CMPTs achieve state-of-the-art performance, consistently outperforming recent baseline methods in both complete and missing modality scenarios. Following the experimental setup of lee2023mapkim2024missing, the radar plot illustrates F1-macro scores on the MM-IMDb dataset across varying modality availability.
  • Figure 2: Generalization to varying missing rates during inference. Models are trained with 100% image + 100% text and evaluated with 100% image + $x$% text following ma2022multimodalkim2024missing. Our approach demonstrates better generalization, particularly under severe modality loss.
  • Figure 3: Effectiveness of CMPTs on the MM-IMDb dataset. All models are trained with 100% image and 100% text data, then evaluated under varying amount of missing modalities. CMPTs achieve significant performance improvement, especially under severe missing modality scenarios.
  • Figure 4: CMPTs improve performance across most of the classes when modalities are missing. However, it fails to accurately predict certain classes (e.g., short/documentary) in the absence of text, likely due to the image modality's insufficient information for those classes. Classes are sorted based on the amount of performance improvement.
  • Figure 5: t-SNE plots of the fused feature tokens $\mathcal{T}$ for ten classes under different modality settings. The t-SNE plots show that fused features without CMPTs (red) deviate significantly from the complete multimodal features (green). Incorporating CMPTs (blue) provides embeddings that closely align with the complete multimodal features, indicating improved semantic alignment and robustness.
  • ...and 8 more figures