Table of Contents
Fetching ...

Deep Multimodal Learning with Missing Modality: A Survey

Renjie Wu, Hu Wang, Hsiang-Ting Chen, Gustavo Carneiro

TL;DR

This survey addresses the challenge of missing modalities in deep multimodal learning (MLMM) by proposing a two-axis taxonomy that separates data processing (modality imputation vs representation-focused) from strategy design (architecture-focused vs model combinations). It provides a fine-grained categorization into twelve methodological families, analyzes 315 papers across diverse domains, and discusses applications, datasets, and open issues. Key contributions include a comprehensive taxonomy, an in-depth comparison of methods with recovery versus non-recovery approaches, and guidance on benchmarking, efficiency, and future directions. The work offers a structured roadmap for researchers to design robust MLMM systems and evaluate progress consistently across applications and modalities.

Abstract

During multimodal model training and testing, certain data modalities may be absent due to sensor limitations, cost constraints, privacy concerns, or data loss, negatively affecting performance. Multimodal learning techniques designed to handle missing modalities can mitigate this by ensuring model robustness even when some modalities are unavailable. This survey reviews recent progress in Multimodal Learning with Missing Modality (MLMM), focusing on deep learning methods. It provides the first comprehensive survey that covers the motivation and distinctions between MLMM and standard multimodal learning setups, followed by a detailed analysis of current methods, applications, and datasets, concluding with challenges and future directions.

Deep Multimodal Learning with Missing Modality: A Survey

TL;DR

This survey addresses the challenge of missing modalities in deep multimodal learning (MLMM) by proposing a two-axis taxonomy that separates data processing (modality imputation vs representation-focused) from strategy design (architecture-focused vs model combinations). It provides a fine-grained categorization into twelve methodological families, analyzes 315 papers across diverse domains, and discusses applications, datasets, and open issues. Key contributions include a comprehensive taxonomy, an in-depth comparison of methods with recovery versus non-recovery approaches, and guidance on benchmarking, efficiency, and future directions. The work offers a structured roadmap for researchers to design robust MLMM systems and evaluate progress consistently across applications and modalities.

Abstract

During multimodal model training and testing, certain data modalities may be absent due to sensor limitations, cost constraints, privacy concerns, or data loss, negatively affecting performance. Multimodal learning techniques designed to handle missing modalities can mitigate this by ensuring model robustness even when some modalities are unavailable. This survey reviews recent progress in Multimodal Learning with Missing Modality (MLMM), focusing on deep learning methods. It provides the first comprehensive survey that covers the motivation and distinctions between MLMM and standard multimodal learning setups, followed by a detailed analysis of current methods, applications, and datasets, concluding with challenges and future directions.
Paper Structure (42 sections, 17 figures, 3 tables)

This paper contains 42 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: (a) The trend of papers published on deep multimodal learning with missing modality in the past 10 years. The number of publications has increased over time and has received widespread attention from the community. (b) Description of a three-modality scenario with full- and missing-modality samples. We abbreviate "Modality" as "Mod" in all figures of this paper and use dashed boxes with fading colors to represent missing modalities/modules.
  • Figure 2: Our taxonomy of deep multimodal learning with missing modality methods. We categorize existing methods into two aspects: data processing and strategy design. Data Processing: we differentiate between modality imputation (handling at the modality data level) and representation-focused models (dealing with at the data representation level). Strategy Design: we distinguish between architecture-focused models (model architecture adjustments) and model combinations (combining multiple models externally). "MLLMs": multimodal large language models.
  • Figure 3: Zero/Random values composition methods. If we assume modality 2 is missing, then this modality will be replaced with zero/random values. "DNN" in all figures of this survey means different kinds of deep neural networks.
  • Figure 4: Retrieval-based modality composition methods search for one or more samples by randomly selecting or using simple retrieval algorithms like KNN, or its variants, from same-category samples that have the required missing modalities, and then compose them with the input missing-modality sample to form a "full"-modality sample.
  • Figure 5: Description of two typical modality generation methods. We set modality 2 as the missing modality for examples and use other available modalities to generate modality 2. "GEN" in both figures represents modality generation networks. (a) We set up a modality-2 generator (GEN-2) from other modalities. (b) All modalities are input and generated together by using a single GEN.
  • ...and 12 more figures