Table of Contents
Fetching ...

Decoupled Hierarchical Distillation for Multimodal Emotion Recognition

Yong Li, Yuanzhi Wang, Yi Ding, Shiqing Zhang, Ke Lu, Cuntai Guan

TL;DR

This work tackles multimodal emotion recognition by addressing cross-modal heterogeneity with a Decoupled Hierarchical Multimodal Distillation (DHMD) framework. DHMD decouples each modality into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) spaces using a self-regression-based training regime, then applies a two-stage knowledge distillation: (i) coarse-grained KD via Graph Distillation Units that learn dynamic inter-modality transfer patterns, and (ii) fine-grained KD via cross-modal Dictionary Matching that aligns semantic granularities across modalities in a shared dictionary space. The method yields consistent improvements over state-of-the-art MER approaches on CMU-MOSI, CMU-MOSEI, UR-FUNNY, and MUStARD, with ablations and visualizations validating the effectiveness of both KD stages and the interpretability of learned graph edges and dictionary activations. The results suggest that decoupling and hierarchical KD enable more robust, discriminative, and interpretable multimodal representations for emotion recognition, with potential for extension to foundation-model-based pipelines.

Abstract

Human multimodal emotion recognition (MER) seeks to infer human emotions by integrating information from language, visual, and acoustic modalities. Although existing MER approaches have achieved promising results, they still struggle with inherent multimodal heterogeneities and varying contributions from different modalities. To address these challenges, we propose a novel framework, Decoupled Hierarchical Multimodal Distillation (DHMD). DHMD decouples each modality's features into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) components using a self-regression mechanism. The framework employs a two-stage knowledge distillation (KD) strategy: (1) coarse-grained KD via a Graph Distillation Unit (GD-Unit) in each decoupled feature space, where a dynamic graph facilitates adaptive distillation among modalities, and (2) fine-grained KD through a cross-modal dictionary matching mechanism, which aligns semantic granularities across modalities to produce more discriminative MER representations. This hierarchical distillation approach enables flexible knowledge transfer and effectively improves cross-modal feature alignment. Experimental results demonstrate that DHMD consistently outperforms state-of-the-art MER methods, achieving 1.3\%/2.4\% (ACC$_7$), 1.3\%/1.9\% (ACC$_2$) and 1.9\%/1.8\% (F1) relative improvement on CMU-MOSI/CMU-MOSEI dataset, respectively. Meanwhile, visualization results reveal that both the graph edges and dictionary activations in DHMD exhibit meaningful distribution patterns across modality-irrelevant/-exclusive feature spaces.

Decoupled Hierarchical Distillation for Multimodal Emotion Recognition

TL;DR

This work tackles multimodal emotion recognition by addressing cross-modal heterogeneity with a Decoupled Hierarchical Multimodal Distillation (DHMD) framework. DHMD decouples each modality into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) spaces using a self-regression-based training regime, then applies a two-stage knowledge distillation: (i) coarse-grained KD via Graph Distillation Units that learn dynamic inter-modality transfer patterns, and (ii) fine-grained KD via cross-modal Dictionary Matching that aligns semantic granularities across modalities in a shared dictionary space. The method yields consistent improvements over state-of-the-art MER approaches on CMU-MOSI, CMU-MOSEI, UR-FUNNY, and MUStARD, with ablations and visualizations validating the effectiveness of both KD stages and the interpretability of learned graph edges and dictionary activations. The results suggest that decoupling and hierarchical KD enable more robust, discriminative, and interpretable multimodal representations for emotion recognition, with potential for extension to foundation-model-based pipelines.

Abstract

Human multimodal emotion recognition (MER) seeks to infer human emotions by integrating information from language, visual, and acoustic modalities. Although existing MER approaches have achieved promising results, they still struggle with inherent multimodal heterogeneities and varying contributions from different modalities. To address these challenges, we propose a novel framework, Decoupled Hierarchical Multimodal Distillation (DHMD). DHMD decouples each modality's features into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) components using a self-regression mechanism. The framework employs a two-stage knowledge distillation (KD) strategy: (1) coarse-grained KD via a Graph Distillation Unit (GD-Unit) in each decoupled feature space, where a dynamic graph facilitates adaptive distillation among modalities, and (2) fine-grained KD through a cross-modal dictionary matching mechanism, which aligns semantic granularities across modalities to produce more discriminative MER representations. This hierarchical distillation approach enables flexible knowledge transfer and effectively improves cross-modal feature alignment. Experimental results demonstrate that DHMD consistently outperforms state-of-the-art MER methods, achieving 1.3\%/2.4\% (ACC), 1.3\%/1.9\% (ACC) and 1.9\%/1.8\% (F1) relative improvement on CMU-MOSI/CMU-MOSEI dataset, respectively. Meanwhile, visualization results reveal that both the graph edges and dictionary activations in DHMD exhibit meaningful distribution patterns across modality-irrelevant/-exclusive feature spaces.
Paper Structure (16 sections, 14 equations, 13 figures, 8 tables)

This paper contains 16 sections, 14 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: (a) illustrates the significant emotion recognition discrepancies using unimodality, adapted from Mult MulT. (b) shows the conventional cross-modal distillation. (c) shows our proposed DHMD method. DHMD implements a two-stage knowledge distillation (KD) approach comprising coarse- and fine-grained KD. Coarse-grained KD utilizes homogeneous and heterogeneous graph distillation units (GD) to reduce the complexity of cross-modal KD, enhancing specificity and efficiency. Fine-grained KD incorporates a shared dictionary as a unified semantic space for cross-modal alignment.
  • Figure 2: The framework of DHMD. Given the input multimodal data, DHMD encodes their respective shallow features $\tbX_{m}$, where $m \in \{L, V, A\}$. In feature decoupling, DHMD exploits the decoupled homo-/heterogeneous multimodal features $\X^{\text{com}}_{m}$ / $\X^{\text{prt}}_{m}$ via the shared and exclusive encoders, respectively. $\X^{\text{prt}}_{m}$ will be reconstructed in a self-regression manner (Sec. \ref{['sec:decoupling']}). For coarse-grained KD, $\X^{\text{com}}_{m}$ and $\X^{\text{prt}}_{m}$ will be fed into a GD-Unit for adaptive KD in HoGD and HeGD, respectively (Sec. \ref{['sec:distillation']}). For fine-grained KD, DHMD utilizes distinct dictionaries within each enhanced feature space to unify semantic granularities across modalities and achieve semantic alignment (Sec. \ref{['sec:cross-modal_CM']}). Finally, the features from the two-stage KD mechanisms are adaptively fused for MER.
  • Figure 3: Framework of the cross-modal Dictionary Matching (DM) mechanism for fine-grained KD. Decoupled multimodal features are projected onto a shared dictionary and subsequently reconstructed as weighted combinations of the dictionary elements. Overlapping elements within this dictionary capture the shared semantics across the modalities, facilitating fine-grained alignment.
  • Figure 4: Comparison of the decoupled homogeneous and heterogeneous features on CMU-MOSEI dataset. w/ FD means the baseline method where we merely obtain the decoupled features. w/ FD, GD means adding a GD-Unit on each decoupled feature space based on w/FD method. w/ FD, GD, DM involves adding the cross-modal dictionary matching (DM) mechanism in each feature space based on w/ FD, GD method. Evidently, both the coarse-grained KD mechanism, enhanced by graph distillation, and the fine-grained KD mechanism, based on dictionary matching, contribute consistently to cross-modal alignment.
  • Figure 5: Comparison of the decoupled homogeneous and heterogeneous features on MUStARD dataset. Obviously, the coarse-/fine-grained KD mechanisms in our proposed DHMD collaboratively enhance the discriminability of each input modality. These mechanisms not only strengthen cross-modal alignment but also lead to improvements in the overall MER performance.
  • ...and 8 more figures