Table of Contents
Fetching ...

Federated Dialogue-Semantic Diffusion for Emotion Recognition under Incomplete Modalities

Xihang Qiu, Jiarong Cheng, Yuhao Fang, Wanpeng Zhang, Yao Lu, Ye Zhang, Chun Li

TL;DR

FedDISC tackles robust Multimodal Emotion Recognition in Conversations under incomplete modalities by federating modality-specific diffusion models trained locally and aggregating them to recover missing modalities across clients. The DISC-Diffusion framework leverages a Dialogue Graph Network (DGN) and a Semantic Conditioning Network (SCN) to provide context and semantic guidance for diffusion-based recovery, while an Alternating Frozen Strategy coordinates recovery and classifier optimization in a privacy-preserving FL setting. Empirical results on IEMOCAP, CMU-MOSI, and CMU-MOSEI demonstrate strong performance under both fixed and random missing modalities, often outperforming state-of-the-art recovery methods and maintaining robustness as missing rates rise. This approach enables cross-client collaboration with privacy, reduces communication costs, and offers practical scalability for real-world MERC deployments with incomplete data.

Abstract

Multimodal Emotion Recognition in Conversations (MERC) enhances emotional understanding through the fusion of multimodal signals. However, unpredictable modality absence in real-world scenarios significantly degrades the performance of existing methods. Conventional missing-modality recovery approaches, which depend on training with complete multimodal data, often suffer from semantic distortion under extreme data distributions, such as fixed-modality absence. To address this, we propose the Federated Dialogue-guided and Semantic-Consistent Diffusion (FedDISC) framework, pioneering the integration of federated learning into missing-modality recovery. By federated aggregation of modality-specific diffusion models trained on clients and broadcasting them to clients missing corresponding modalities, FedDISC overcomes single-client reliance on modality completeness. Additionally, the DISC-Diffusion module ensures consistency in context, speaker identity, and semantics between recovered and available modalities, using a Dialogue Graph Network to capture conversational dependencies and a Semantic Conditioning Network to enforce semantic alignment. We further introduce a novel Alternating Frozen Aggregation strategy, which cyclically freezes recovery and classifier modules to facilitate collaborative optimization. Extensive experiments on the IEMOCAP, CMUMOSI, and CMUMOSEI datasets demonstrate that FedDISC achieves superior emotion classification performance across diverse missing modality patterns, outperforming existing approaches.

Federated Dialogue-Semantic Diffusion for Emotion Recognition under Incomplete Modalities

TL;DR

FedDISC tackles robust Multimodal Emotion Recognition in Conversations under incomplete modalities by federating modality-specific diffusion models trained locally and aggregating them to recover missing modalities across clients. The DISC-Diffusion framework leverages a Dialogue Graph Network (DGN) and a Semantic Conditioning Network (SCN) to provide context and semantic guidance for diffusion-based recovery, while an Alternating Frozen Strategy coordinates recovery and classifier optimization in a privacy-preserving FL setting. Empirical results on IEMOCAP, CMU-MOSI, and CMU-MOSEI demonstrate strong performance under both fixed and random missing modalities, often outperforming state-of-the-art recovery methods and maintaining robustness as missing rates rise. This approach enables cross-client collaboration with privacy, reduces communication costs, and offers practical scalability for real-world MERC deployments with incomplete data.

Abstract

Multimodal Emotion Recognition in Conversations (MERC) enhances emotional understanding through the fusion of multimodal signals. However, unpredictable modality absence in real-world scenarios significantly degrades the performance of existing methods. Conventional missing-modality recovery approaches, which depend on training with complete multimodal data, often suffer from semantic distortion under extreme data distributions, such as fixed-modality absence. To address this, we propose the Federated Dialogue-guided and Semantic-Consistent Diffusion (FedDISC) framework, pioneering the integration of federated learning into missing-modality recovery. By federated aggregation of modality-specific diffusion models trained on clients and broadcasting them to clients missing corresponding modalities, FedDISC overcomes single-client reliance on modality completeness. Additionally, the DISC-Diffusion module ensures consistency in context, speaker identity, and semantics between recovered and available modalities, using a Dialogue Graph Network to capture conversational dependencies and a Semantic Conditioning Network to enforce semantic alignment. We further introduce a novel Alternating Frozen Aggregation strategy, which cyclically freezes recovery and classifier modules to facilitate collaborative optimization. Extensive experiments on the IEMOCAP, CMUMOSI, and CMUMOSEI datasets demonstrate that FedDISC achieves superior emotion classification performance across diverse missing modality patterns, outperforming existing approaches.

Paper Structure

This paper contains 27 sections, 3 theorems, 32 equations, 9 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

Run FedDISC for $T$ global rounds with step-size $\eta \le 1/M$, $E$ local steps, $K$ clients chosen i.i.d., then

Figures (9)

  • Figure 1: The frame work of DGN and SCN. DGN captures context and speaker dependencies through graph network while SCN captures cross-modal semantic information with attention mechanism.
  • Figure 2: The training pipeline of FedDISC, this figure delineates a hierarchical federated learning framework with two alternating phases: the Recovery Module Training Stage (right) and the Classifier Optimization Stage (left).
  • Figure 3: The t-SNE visualization compares the modality recovery performance of different methods under single-modality availability. the features generated by FedDISC exhibit higher distributional similarity to the original modality features compared to other methods, demonstrating its effectiveness.
  • Figure 4: The visualized ablation study on the IEMPCAP6 dataset. Compared with unconditional modality recovery, DGN and SCN guide recovery by leveraging dialog and semantic alignment, ensuring category consistency between recovered and original modalities.
  • Figure 5: (a) illustrates two types of incomplete modalities: random missing protocol and fixed missing protocol. (b) presents the federated generative modality recovery paradigm proposed in this work, designed to alleviate the dependency of generative recovery on data completeness.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Theorem 1: Convergence of Recovery-Only Rounds
  • Theorem 2: Error Bound for Recovered Latent Vectors
  • Theorem 3: Linear Rate of Alternating‐Freeze Optimisation