Table of Contents
Fetching ...

Uncertainty-Resilient Multimodal Learning via Consistency-Guided Cross-Modal Transfer

Hyo-Jeong Jang

TL;DR

This work addresses the challenge of uncertainty in real-world multimodal learning for brain–computer interfaces and human–computer interaction by proposing consistency-guided cross-modal transfer. It combines uncertainty-aware cross-modal knowledge distillation with a prototype-based similarity module and a Dirichlet-based uncertainty estimate, and a joint cross-modal active learning framework that leverages EEG and facial cues to guide sample annotation. Empirical results on MAHNOB-HCI show improved stability, discriminative power, and robustness to noisy or incomplete supervision for both discrete and continuous emotion tasks, along with qualitative latent-space analyses confirming preserved cross-modal structure under challenging conditions. The approach offers a practical, scalable path toward reliable, adaptive multimodal systems for BCI/HCI applications.

Abstract

Multimodal learning systems often face substantial uncertainty due to noisy data, low-quality labels, and heterogeneous modality characteristics. These issues become especially critical in human-computer interaction settings, where data quality, semantic reliability, and annotation consistency vary across users and recording conditions. This thesis tackles these challenges by exploring uncertainty-resilient multimodal learning through consistency-guided cross-modal transfer. The central idea is to use cross-modal semantic consistency as a basis for robust representation learning. By projecting heterogeneous modalities into a shared latent space, the proposed framework mitigates modality gaps and uncovers structural relations that support uncertainty estimation and stable feature learning. Building on this foundation, the thesis investigates strategies to enhance semantic robustness, improve data efficiency, and reduce the impact of noise and imperfect supervision without relying on large, high-quality annotations. Experiments on multimodal affect-recognition benchmarks demonstrate that consistency-guided cross-modal transfer significantly improves model stability, discriminative ability, and robustness to noisy or incomplete supervision. Latent space analyses further show that the framework captures reliable cross-modal structure even under challenging conditions. Overall, this thesis offers a unified perspective on resilient multimodal learning by integrating uncertainty modeling, semantic alignment, and data-efficient supervision, providing practical insights for developing reliable and adaptive brain-computer interface systems.

Uncertainty-Resilient Multimodal Learning via Consistency-Guided Cross-Modal Transfer

TL;DR

This work addresses the challenge of uncertainty in real-world multimodal learning for brain–computer interfaces and human–computer interaction by proposing consistency-guided cross-modal transfer. It combines uncertainty-aware cross-modal knowledge distillation with a prototype-based similarity module and a Dirichlet-based uncertainty estimate, and a joint cross-modal active learning framework that leverages EEG and facial cues to guide sample annotation. Empirical results on MAHNOB-HCI show improved stability, discriminative power, and robustness to noisy or incomplete supervision for both discrete and continuous emotion tasks, along with qualitative latent-space analyses confirming preserved cross-modal structure under challenging conditions. The approach offers a practical, scalable path toward reliable, adaptive multimodal systems for BCI/HCI applications.

Abstract

Multimodal learning systems often face substantial uncertainty due to noisy data, low-quality labels, and heterogeneous modality characteristics. These issues become especially critical in human-computer interaction settings, where data quality, semantic reliability, and annotation consistency vary across users and recording conditions. This thesis tackles these challenges by exploring uncertainty-resilient multimodal learning through consistency-guided cross-modal transfer. The central idea is to use cross-modal semantic consistency as a basis for robust representation learning. By projecting heterogeneous modalities into a shared latent space, the proposed framework mitigates modality gaps and uncovers structural relations that support uncertainty estimation and stable feature learning. Building on this foundation, the thesis investigates strategies to enhance semantic robustness, improve data efficiency, and reduce the impact of noise and imperfect supervision without relying on large, high-quality annotations. Experiments on multimodal affect-recognition benchmarks demonstrate that consistency-guided cross-modal transfer significantly improves model stability, discriminative ability, and robustness to noisy or incomplete supervision. Latent space analyses further show that the framework captures reliable cross-modal structure even under challenging conditions. Overall, this thesis offers a unified perspective on resilient multimodal learning by integrating uncertainty modeling, semantic alignment, and data-efficient supervision, providing practical insights for developing reliable and adaptive brain-computer interface systems.

Paper Structure

This paper contains 27 sections, 25 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the proposed cross-modal AL framework for enhancing emotional intelligence in multimodal human–agent interaction. (a) Problem: The agent with limited affective understanding misinterprets the user’s emotional states from multimodal signals, resulting in inaccurate emotion recognition and poor adaptability. (b) Proposed method: The proposed framework enables the agent to improve emotional understanding via uncertainty-based active learning that selectively queries the user for feedback and updates its model.
  • Figure 2: Visualization of the learned feature distribution. For comparison, we present the features learned by the unimodal sota model MASA-TCN (top), the multimodal sota CAFNet (middle), and our proposed cross-modal KD method (bottom) for both the arousal and valence classification tasks.
  • Figure 3: Evolution of uncertainty during cross-modal active learning. (a) Comparison of uncertainty distributions between the first and last iterations. (b) Trend of the top-5 % uncertainty mean across active learning steps.