CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning
Ronghao Lin, Qiaolin He, Sijie Mai, Ying Zeng, Aolin Xiong, Li Huang, Yap-Peng Tan, Haifeng Hu
TL;DR
CyIN tackles the challenge of robust multimodal learning under unpredictable missing modalities by constructing an informative latent space guided by token- and label-level Information Bottlenecks and enabling cross-modal cyclic translation between modalities. This unified framework supports both complete and incomplete multimodal learning within a single model, using a Cascaded Residual Autoencoder translator and a Multimodal Transformer for fusion. The method demonstrates strong performance and robustness across four datasets and diverse missing-data scenarios, supported by ablations that highlight the importance of the bottleneck design and cyclic translation. The work offers a principled approach to robust multimodal fusion with potential for broad real-world impact, while acknowledging limitations and avenues for enhancement such as modality imbalance handling and alternative translators.
Abstract
Multimodal machine learning, mimicking the human brain's ability to integrate various modalities has seen rapid growth. Most previous multimodal models are trained on perfectly paired multimodal input to reach optimal performance. In real-world deployments, however, the presence of modality is highly variable and unpredictable, causing the pre-trained models in suffering significant performance drops and fail to remain robust with dynamic missing modalities circumstances. In this paper, we present a novel Cyclic INformative Learning framework (CyIN) to bridge the gap between complete and incomplete multimodal learning. Specifically, we firstly build an informative latent space by adopting token- and label-level Information Bottleneck (IB) cyclically among various modalities. Capturing task-related features with variational approximation, the informative bottleneck latents are purified for more efficient cross-modal interaction and multimodal fusion. Moreover, to supplement the missing information caused by incomplete multimodal input, we propose cross-modal cyclic translation by reconstruct the missing modalities with the remained ones through forward and reverse propagation process. With the help of the extracted and reconstructed informative latents, CyIN succeeds in jointly optimizing complete and incomplete multimodal learning in one unified model. Extensive experiments on 4 multimodal datasets demonstrate the superior performance of our method in both complete and diverse incomplete scenarios.
