Table of Contents
Fetching ...

CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning

Ronghao Lin, Qiaolin He, Sijie Mai, Ying Zeng, Aolin Xiong, Li Huang, Yap-Peng Tan, Haifeng Hu

TL;DR

CyIN tackles the challenge of robust multimodal learning under unpredictable missing modalities by constructing an informative latent space guided by token- and label-level Information Bottlenecks and enabling cross-modal cyclic translation between modalities. This unified framework supports both complete and incomplete multimodal learning within a single model, using a Cascaded Residual Autoencoder translator and a Multimodal Transformer for fusion. The method demonstrates strong performance and robustness across four datasets and diverse missing-data scenarios, supported by ablations that highlight the importance of the bottleneck design and cyclic translation. The work offers a principled approach to robust multimodal fusion with potential for broad real-world impact, while acknowledging limitations and avenues for enhancement such as modality imbalance handling and alternative translators.

Abstract

Multimodal machine learning, mimicking the human brain's ability to integrate various modalities has seen rapid growth. Most previous multimodal models are trained on perfectly paired multimodal input to reach optimal performance. In real-world deployments, however, the presence of modality is highly variable and unpredictable, causing the pre-trained models in suffering significant performance drops and fail to remain robust with dynamic missing modalities circumstances. In this paper, we present a novel Cyclic INformative Learning framework (CyIN) to bridge the gap between complete and incomplete multimodal learning. Specifically, we firstly build an informative latent space by adopting token- and label-level Information Bottleneck (IB) cyclically among various modalities. Capturing task-related features with variational approximation, the informative bottleneck latents are purified for more efficient cross-modal interaction and multimodal fusion. Moreover, to supplement the missing information caused by incomplete multimodal input, we propose cross-modal cyclic translation by reconstruct the missing modalities with the remained ones through forward and reverse propagation process. With the help of the extracted and reconstructed informative latents, CyIN succeeds in jointly optimizing complete and incomplete multimodal learning in one unified model. Extensive experiments on 4 multimodal datasets demonstrate the superior performance of our method in both complete and diverse incomplete scenarios.

CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning

TL;DR

CyIN tackles the challenge of robust multimodal learning under unpredictable missing modalities by constructing an informative latent space guided by token- and label-level Information Bottlenecks and enabling cross-modal cyclic translation between modalities. This unified framework supports both complete and incomplete multimodal learning within a single model, using a Cascaded Residual Autoencoder translator and a Multimodal Transformer for fusion. The method demonstrates strong performance and robustness across four datasets and diverse missing-data scenarios, supported by ablations that highlight the importance of the bottleneck design and cyclic translation. The work offers a principled approach to robust multimodal fusion with potential for broad real-world impact, while acknowledging limitations and avenues for enhancement such as modality imbalance handling and alternative translators.

Abstract

Multimodal machine learning, mimicking the human brain's ability to integrate various modalities has seen rapid growth. Most previous multimodal models are trained on perfectly paired multimodal input to reach optimal performance. In real-world deployments, however, the presence of modality is highly variable and unpredictable, causing the pre-trained models in suffering significant performance drops and fail to remain robust with dynamic missing modalities circumstances. In this paper, we present a novel Cyclic INformative Learning framework (CyIN) to bridge the gap between complete and incomplete multimodal learning. Specifically, we firstly build an informative latent space by adopting token- and label-level Information Bottleneck (IB) cyclically among various modalities. Capturing task-related features with variational approximation, the informative bottleneck latents are purified for more efficient cross-modal interaction and multimodal fusion. Moreover, to supplement the missing information caused by incomplete multimodal input, we propose cross-modal cyclic translation by reconstruct the missing modalities with the remained ones through forward and reverse propagation process. With the help of the extracted and reconstructed informative latents, CyIN succeeds in jointly optimizing complete and incomplete multimodal learning in one unified model. Extensive experiments on 4 multimodal datasets demonstrate the superior performance of our method in both complete and diverse incomplete scenarios.
Paper Structure (29 sections, 57 equations, 2 figures, 14 tables)

This paper contains 29 sections, 57 equations, 2 figures, 14 tables.

Figures (2)

  • Figure 1: Framework overview. The proposed CyIN build a cyclic informative space to jointly train the complete and incomplete multimodal ealrning.
  • Figure 2: (a) Feature distribution of translated unimodal latents and multimodal latents and (b) Examples on the test set of MOSI and IEMOCAP datasets when inferring with and without reconstructed information from CyIN. The ✓ and ✘ in modal status denotes the remained and missing modalities.