Uncertainty-Resilient Multimodal Learning via Consistency-Guided Cross-Modal Transfer
Hyo-Jeong Jang
TL;DR
This work addresses the challenge of uncertainty in real-world multimodal learning for brain–computer interfaces and human–computer interaction by proposing consistency-guided cross-modal transfer. It combines uncertainty-aware cross-modal knowledge distillation with a prototype-based similarity module and a Dirichlet-based uncertainty estimate, and a joint cross-modal active learning framework that leverages EEG and facial cues to guide sample annotation. Empirical results on MAHNOB-HCI show improved stability, discriminative power, and robustness to noisy or incomplete supervision for both discrete and continuous emotion tasks, along with qualitative latent-space analyses confirming preserved cross-modal structure under challenging conditions. The approach offers a practical, scalable path toward reliable, adaptive multimodal systems for BCI/HCI applications.
Abstract
Multimodal learning systems often face substantial uncertainty due to noisy data, low-quality labels, and heterogeneous modality characteristics. These issues become especially critical in human-computer interaction settings, where data quality, semantic reliability, and annotation consistency vary across users and recording conditions. This thesis tackles these challenges by exploring uncertainty-resilient multimodal learning through consistency-guided cross-modal transfer. The central idea is to use cross-modal semantic consistency as a basis for robust representation learning. By projecting heterogeneous modalities into a shared latent space, the proposed framework mitigates modality gaps and uncovers structural relations that support uncertainty estimation and stable feature learning. Building on this foundation, the thesis investigates strategies to enhance semantic robustness, improve data efficiency, and reduce the impact of noise and imperfect supervision without relying on large, high-quality annotations. Experiments on multimodal affect-recognition benchmarks demonstrate that consistency-guided cross-modal transfer significantly improves model stability, discriminative ability, and robustness to noisy or incomplete supervision. Latent space analyses further show that the framework captures reliable cross-modal structure even under challenging conditions. Overall, this thesis offers a unified perspective on resilient multimodal learning by integrating uncertainty modeling, semantic alignment, and data-efficient supervision, providing practical insights for developing reliable and adaptive brain-computer interface systems.
