Table of Contents
Fetching ...

Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities

Muhammad Irzam Liaqat, Shah Nawaz, Muhammad Zaigham Zaheer, Muhammad Saad Saeed, Hassan Sajjad, Tom De Schepper, Karthik Nandakumar, Muhammad Haris Khan Markus Schedl

TL;DR

This work tackles robustness in textual-visual multimodal learning under missing modalities. It introduces Chameleon, which unifies textual and visual inputs by encoding word embeddings as color-coded pixels, enabling a single visual network to learn joint representations. Two training schemes, joint and fused, allow the model to leverage available modalities even when one is missing, achieving SOTA performance on several datasets with complete modalities and strong robustness when modalities are absent. The approach is backbone- and dataset-agnostic, and analyses show reliable behavior across CNNs and ViTs, offering a practical path for robust multimodal systems in real-world incomplete data settings.

Abstract

Multimodal learning has demonstrated remarkable performance improvements over unimodal architectures. However, multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing. This may be attributed to the commonly used multi-branch design containing modality-specific streams making the models reliant on the availability of a complete set of modalities. In this work, we propose a robust textual-visual multimodal learning method, Chameleon, that completely deviates from the conventional multi-branch design. To enable this, we present the unification of input modalities into one format by encoding textual modality into visual representations. As a result, our approach does not require modality-specific branches to learn modality-independent multimodal representations making it robust to missing modalities. Extensive experiments are performed on four popular challenging datasets including Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta. Chameleon not only achieves superior performance when all modalities are present at train/test time but also demonstrates notable resilience in the case of missing modalities.

Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities

TL;DR

This work tackles robustness in textual-visual multimodal learning under missing modalities. It introduces Chameleon, which unifies textual and visual inputs by encoding word embeddings as color-coded pixels, enabling a single visual network to learn joint representations. Two training schemes, joint and fused, allow the model to leverage available modalities even when one is missing, achieving SOTA performance on several datasets with complete modalities and strong robustness when modalities are absent. The approach is backbone- and dataset-agnostic, and analyses show reliable behavior across CNNs and ViTs, offering a practical path for robust multimodal systems in real-world incomplete data settings.

Abstract

Multimodal learning has demonstrated remarkable performance improvements over unimodal architectures. However, multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing. This may be attributed to the commonly used multi-branch design containing modality-specific streams making the models reliant on the availability of a complete set of modalities. In this work, we propose a robust textual-visual multimodal learning method, Chameleon, that completely deviates from the conventional multi-branch design. To enable this, we present the unification of input modalities into one format by encoding textual modality into visual representations. As a result, our approach does not require modality-specific branches to learn modality-independent multimodal representations making it robust to missing modalities. Extensive experiments are performed on four popular challenging datasets including Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta. Chameleon not only achieves superior performance when all modalities are present at train/test time but also demonstrates notable resilience in the case of missing modalities.
Paper Structure (23 sections, 2 equations, 4 figures, 4 tables)

This paper contains 23 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The overall architecture of Chameleon. (a) Multimodal input consists of images and text. (b) Word embeddings are encoded as an image. Example of encoding three words: 'Extol', 'craft', and 'scissors' is provided with embedding length $15$. (c) The joint variant of Chameleon in which visual and encoded text inputs are fed to a weight-sharing visual network in a sequential fashion. (d) The fused variant of Chameleon in which a fused image created by fusing both visual and encoded text inputs are fed to the network.
  • Figure 2: Grad-CAM visualizations of selected images from UPMC Food-$101$: (a) Orignal image randomly selected from the test set. (b) Unimodal image only training and testing. (c) Multimodal training and testing. (d) Multimodal training; testing on image only, i.e., 100% text missing. (e) Multimodal training; testing on text only, i.e., 100% image missing. As seen, in unimodal image-only training (b), the model focuses on distinct features of the object. With our multimodal training (c), the model not only retains its focus on the object but also includes encodings representing text modality to make the predictions. When the text modality (d) or image modality (e) is missing during testing, the model focuses on the available modality to make accurate final predictions demonstrating the success of Chameleon in training the multimodal method robust to missing modalities.
  • Figure 3: Performance comparisons of the two variants of Chameleon (fused and joint) with ViLT kim2021vilt and Ma et al. ma2022multimodal on various levels of missing textual modality during testing on UPMC Food-101 and Hateful Memes datasets. A smaller drop in performance by our approach in most cases signifies its effectiveness towards training Vision Transformers resilient to missing modalities without dataset-centric fusion strategies.
  • Figure 4: t-SNE visualizations of the embedding space of Chameleon (a - c) and ViLT (d - f) along with accuracy on test set of UPMC Food-$101$. Compared to ViLT, Chameleon not only enhances the classification boundaries when complete modalities are available at test time but also retains these boundaries when the textual or visual modality is completely missing during test time. Note that classes are selected randomly from the test set.