Table of Contents
Fetching ...

Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach

Muhammad Saad Saeed, Shah Nawaz, Muhammad Zaigham Zaheer, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf, Hassan Sajjad, Tom De Schepper, Markus Schedl

TL;DR

This paper tackles the missing-modality problem in multimodal learning by proposing SRMM, a modality-invariant single-branch network that shares weights across modalities and uses a modality-switching mechanism to learn inter-modality representations. Unlike traditional multi-branch fusion, SRMM maintains performance when modalities are incomplete and demonstrates strong results across textual-visual and audio-visual tasks. The approach achieves state-of-the-art results under complete modalities on several datasets and shows superior robustness to missing and corrupted modalities, highlighting its practical utility in real-world scenarios. The work also analyzes design choices, embedding extractors, and switching strategies, underscoring the method's efficiency and potential for broader multimodal applications.

Abstract

Multimodal networks have demonstrated remarkable performance improvements over their unimodal counterparts. Existing multimodal networks are designed in a multi-branch fashion that, due to the reliance on fusion strategies, exhibit deteriorated performance if one or more modalities are missing. In this work, we propose a modality invariant multimodal learning method, which is less susceptible to the impact of missing modalities. It consists of a single-branch network sharing weights across multiple modalities to learn inter-modality representations to maximize performance as well as robustness to missing modalities. Extensive experiments are performed on four challenging datasets including textual-visual (UPMC Food-101, Hateful Memes, Ferramenta) and audio-visual modalities (VoxCeleb1). Our proposed method achieves superior performance when all modalities are present as well as in the case of missing modalities during training or testing compared to the existing state-of-the-art methods.

Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach

TL;DR

This paper tackles the missing-modality problem in multimodal learning by proposing SRMM, a modality-invariant single-branch network that shares weights across modalities and uses a modality-switching mechanism to learn inter-modality representations. Unlike traditional multi-branch fusion, SRMM maintains performance when modalities are incomplete and demonstrates strong results across textual-visual and audio-visual tasks. The approach achieves state-of-the-art results under complete modalities on several datasets and shows superior robustness to missing and corrupted modalities, highlighting its practical utility in real-world scenarios. The work also analyzes design choices, embedding extractors, and switching strategies, underscoring the method's efficiency and potential for broader multimodal applications.

Abstract

Multimodal networks have demonstrated remarkable performance improvements over their unimodal counterparts. Existing multimodal networks are designed in a multi-branch fashion that, due to the reliance on fusion strategies, exhibit deteriorated performance if one or more modalities are missing. In this work, we propose a modality invariant multimodal learning method, which is less susceptible to the impact of missing modalities. It consists of a single-branch network sharing weights across multiple modalities to learn inter-modality representations to maximize performance as well as robustness to missing modalities. Extensive experiments are performed on four challenging datasets including textual-visual (UPMC Food-101, Hateful Memes, Ferramenta) and audio-visual modalities (VoxCeleb1). Our proposed method achieves superior performance when all modalities are present as well as in the case of missing modalities during training or testing compared to the existing state-of-the-art methods.
Paper Structure (18 sections, 3 equations, 5 figures, 11 tables)

This paper contains 18 sections, 3 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Illustrations of commonly used multi-branch networks. These approaches learn a joint representation with fusion mechanisms (early, late or middle) from the embeddings of modality X and Y feng2020deep. In contrast, our proposed modality invariant method leverages only one branch to learn similar representations.
  • Figure 2: Overall architecture of SRMM. Modality-specific pre-trained networks (vision and audio networks in the given example) are used to extract embeddings which are passed through a modality switching mechanism and input to our single-branch network which learns modality independent representations to encode inter-modality representation with weight sharing across multiple modalities.
  • Figure 3: Examples of (a) complete modalities, (b) visual missing, and (c) audio missing settings.
  • Figure 4: Performance evaluation of SRMM on missing modalities over four datasets including textual-visual (UPMC Food-101, Hateful Memes, Ferramenta) and audio-visual modalities (VoxCeleb$1$). Dotted lines represent unimodal results. Audio modality (in case of VoxCeleb$1$) and text modality (in case of other three datasets) is gradually dropped from 0% to 100% by randomly eliminating samples from the test data.
  • Figure 5: t-SNE visualizations of the embedding space of SRMM (embeddings from the second block), ViLT and TBN on test set of UPMC Food-$101$. It can be seen that SRMM not only enhances the classification boundaries when complete modalities are available at test time but also retains these boundaries when the textual modality is completely missing during test time.