Table of Contents
Fetching ...

Multimodal Fusion on Low-quality Data: A Comprehensive Survey

Qingyang Zhang, Yake Wei, Zongbo Han, Huazhu Fu, Xi Peng, Cheng Deng, Qinghua Hu, Cai Xu, Jie Wen, Di Hu, Changqing Zhang

TL;DR

This paper surveys the common challenges and recent advances of multimodal fusion in the wild and presents them in a comprehensive taxonomy to enable researchers to understand the state of the field and identify several potential directions.

Abstract

Multimodal fusion focuses on integrating information from multiple modalities with the goal of more accurate prediction, which has achieved remarkable progress in a wide range of scenarios, including autonomous driving and medical diagnosis. However, the reliability of multimodal fusion remains largely unexplored especially under low-quality data settings. This paper surveys the common challenges and recent advances of multimodal fusion in the wild and presents them in a comprehensive taxonomy. From a data-centric view, we identify four main challenges that are faced by multimodal fusion on low-quality data, namely (1) noisy multimodal data that are contaminated with heterogeneous noises, (2) incomplete multimodal data that some modalities are missing, (3) imbalanced multimodal data that the qualities or properties of different modalities are significantly different and (4) quality-varying multimodal data that the quality of each modality dynamically changes with respect to different samples. This new taxonomy will enable researchers to understand the state of the field and identify several potential directions. We also provide discussion for the open problems in this field together with interesting future research directions.

Multimodal Fusion on Low-quality Data: A Comprehensive Survey

TL;DR

This paper surveys the common challenges and recent advances of multimodal fusion in the wild and presents them in a comprehensive taxonomy to enable researchers to understand the state of the field and identify several potential directions.

Abstract

Multimodal fusion focuses on integrating information from multiple modalities with the goal of more accurate prediction, which has achieved remarkable progress in a wide range of scenarios, including autonomous driving and medical diagnosis. However, the reliability of multimodal fusion remains largely unexplored especially under low-quality data settings. This paper surveys the common challenges and recent advances of multimodal fusion in the wild and presents them in a comprehensive taxonomy. From a data-centric view, we identify four main challenges that are faced by multimodal fusion on low-quality data, namely (1) noisy multimodal data that are contaminated with heterogeneous noises, (2) incomplete multimodal data that some modalities are missing, (3) imbalanced multimodal data that the qualities or properties of different modalities are significantly different and (4) quality-varying multimodal data that the quality of each modality dynamically changes with respect to different samples. This new taxonomy will enable researchers to understand the state of the field and identify several potential directions. We also provide discussion for the open problems in this field together with interesting future research directions.
Paper Structure (33 sections, 12 equations, 5 figures, 4 tables)

This paper contains 33 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustrations of challenges for machine learning on low-quality multimodal data. Blue and gold represent various modalities. Deeper color denotes higher quality. Assuming we have $N$ multimodal samples consisting $M$ different modalities, the dimension of each modality is $D$. $q(x)$ denotes the quality of multimodal input, i.e., the information collected from $x$ that can support the downstream tasks. (a) The quality of noisy multimodal data is randomly influenced by unexpected environmental factors. (b) Certain modalities of incomplete multimodal data are of zero quality (do not convey any useful information). (c) The expected quality of modalities are different for imbalanced multimodal data. (d) The quality of different modality are varying for samples.
  • Figure 2: Imputation based incomplete multimodal learning.
  • Figure 3: The learning curves (error-rate) of audio model (A), video model (V), and the naive joint audio-video (AV) model on the Kinetics dataset. Solid lines plot validation error while dashed lines show train error. Figure is from wang2020makes
  • Figure 4: Performance of the uni-modal models, joint-trained multimodal model, and multimodal model with OGM-GE peng2022balanced on the validation set of the VGGSound dataset. Figures are from peng2022balanced.
  • Figure 5: Illustrations of dynamic fusion.