Table of Contents
Fetching ...

RoHyDR: Robust Hybrid Diffusion Recovery for Incomplete Multimodal Emotion Recognition

Yuehan Jin, Xiaoqing Liu, Yiyuan Yang, Zhiwen Yu, Tong Zhang, Kaixiang Yang

TL;DR

IMER addresses emotion recognition when modalities are missing or corrupted. RoHyDR introduces a dual-path approach: a diffusion-based unimodal recovery generator conditioned on available modalities, and an adversarial multimodal fusion module guided by a discriminator to reconstruct robust fused representations. A three-stage optimization stabilizes training by separately optimizing unimodal recovery, fusion, and classification objectives. Extensive experiments on MOSI and MOSEI demonstrate state-of-the-art or competitive performance under various missing-modality scenarios, highlighting improved robustness and training stability with respect to prior IMER methods.

Abstract

Multimodal emotion recognition analyzes emotions by combining data from multiple sources. However, real-world noise or sensor failures often cause missing or corrupted data, creating the Incomplete Multimodal Emotion Recognition (IMER) challenge. In this paper, we propose Robust Hybrid Diffusion Recovery (RoHyDR), a novel framework that performs missing-modality recovery at unimodal, multimodal, feature, and semantic levels. For unimodal representation recovery of missing modalities, RoHyDR exploits a diffusion-based generator to generate distribution-consistent and semantically aligned representations from Gaussian noise, using available modalities as conditioning. For multimodal fusion recovery, we introduce adversarial learning to produce a realistic fused multimodal representation and recover missing semantic content. We further propose a multi-stage optimization strategy that enhances training stability and efficiency. In contrast to previous work, the hybrid diffusion and adversarial learning-based recovery mechanism in RoHyDR allows recovery of missing information in both unimodal representation and multimodal fusion, at both feature and semantic levels, effectively mitigating performance degradation caused by suboptimal optimization. Comprehensive experiments conducted on two widely used multimodal emotion recognition benchmarks demonstrate that our proposed method outperforms state-of-the-art IMER methods, achieving robust recognition performance under various missing-modality scenarios. Our code will be made publicly available upon acceptance.

RoHyDR: Robust Hybrid Diffusion Recovery for Incomplete Multimodal Emotion Recognition

TL;DR

IMER addresses emotion recognition when modalities are missing or corrupted. RoHyDR introduces a dual-path approach: a diffusion-based unimodal recovery generator conditioned on available modalities, and an adversarial multimodal fusion module guided by a discriminator to reconstruct robust fused representations. A three-stage optimization stabilizes training by separately optimizing unimodal recovery, fusion, and classification objectives. Extensive experiments on MOSI and MOSEI demonstrate state-of-the-art or competitive performance under various missing-modality scenarios, highlighting improved robustness and training stability with respect to prior IMER methods.

Abstract

Multimodal emotion recognition analyzes emotions by combining data from multiple sources. However, real-world noise or sensor failures often cause missing or corrupted data, creating the Incomplete Multimodal Emotion Recognition (IMER) challenge. In this paper, we propose Robust Hybrid Diffusion Recovery (RoHyDR), a novel framework that performs missing-modality recovery at unimodal, multimodal, feature, and semantic levels. For unimodal representation recovery of missing modalities, RoHyDR exploits a diffusion-based generator to generate distribution-consistent and semantically aligned representations from Gaussian noise, using available modalities as conditioning. For multimodal fusion recovery, we introduce adversarial learning to produce a realistic fused multimodal representation and recover missing semantic content. We further propose a multi-stage optimization strategy that enhances training stability and efficiency. In contrast to previous work, the hybrid diffusion and adversarial learning-based recovery mechanism in RoHyDR allows recovery of missing information in both unimodal representation and multimodal fusion, at both feature and semantic levels, effectively mitigating performance degradation caused by suboptimal optimization. Comprehensive experiments conducted on two widely used multimodal emotion recognition benchmarks demonstrate that our proposed method outperforms state-of-the-art IMER methods, achieving robust recognition performance under various missing-modality scenarios. Our code will be made publicly available upon acceptance.

Paper Structure

This paper contains 16 sections, 12 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) Previous generative model-based methods typically employ a single-level autoencoder-based generator to recover each missing unimodal input. (b) Our proposed Robust Hybrid Diffusion Recovery (RoHyDR) integrates a diffusion-based generator for unimodal recovery and adversarial learning components for multimodal recovery. Additionally, the multi-stage optimization strategy effectively mitigates the issue of suboptimal convergence. (c) RoHyDR exhibits more stable training dynamics across 50 training epochs.
  • Figure 2: The framework of the proposed RoHyDR. In the context of IMER, multimodal data $[A; T; V]$ is categorized into two types: available modalities and missing modalities. Any single modality or any pair of modalities may be missing (vision modality is missed as an example here). During training, RoHyDR receives the ground-truth of the missing modalities, available modalities, and Gaussian noise as inputs. During testing, only the available modalities and Gaussian noise are provided as inputs.
  • Figure 3: Visualization of training stability and efficiency. (a–d) illustrate the comparison between RoHyDR and state-of-the-art methods. (e–h) demonstrate the ablation study of optimization strategy.
  • Figure 4: Hyper-parameter analysis. (a) and (d) analyze the effect of $\lambda_{g}$, (b) and (e) analyze $\lambda_{al}$, (c) and (f) examine the impact of $\lambda_{c}$.