Decoupled Hierarchical Distillation for Multimodal Emotion Recognition
Yong Li, Yuanzhi Wang, Yi Ding, Shiqing Zhang, Ke Lu, Cuntai Guan
TL;DR
This work tackles multimodal emotion recognition by addressing cross-modal heterogeneity with a Decoupled Hierarchical Multimodal Distillation (DHMD) framework. DHMD decouples each modality into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) spaces using a self-regression-based training regime, then applies a two-stage knowledge distillation: (i) coarse-grained KD via Graph Distillation Units that learn dynamic inter-modality transfer patterns, and (ii) fine-grained KD via cross-modal Dictionary Matching that aligns semantic granularities across modalities in a shared dictionary space. The method yields consistent improvements over state-of-the-art MER approaches on CMU-MOSI, CMU-MOSEI, UR-FUNNY, and MUStARD, with ablations and visualizations validating the effectiveness of both KD stages and the interpretability of learned graph edges and dictionary activations. The results suggest that decoupling and hierarchical KD enable more robust, discriminative, and interpretable multimodal representations for emotion recognition, with potential for extension to foundation-model-based pipelines.
Abstract
Human multimodal emotion recognition (MER) seeks to infer human emotions by integrating information from language, visual, and acoustic modalities. Although existing MER approaches have achieved promising results, they still struggle with inherent multimodal heterogeneities and varying contributions from different modalities. To address these challenges, we propose a novel framework, Decoupled Hierarchical Multimodal Distillation (DHMD). DHMD decouples each modality's features into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) components using a self-regression mechanism. The framework employs a two-stage knowledge distillation (KD) strategy: (1) coarse-grained KD via a Graph Distillation Unit (GD-Unit) in each decoupled feature space, where a dynamic graph facilitates adaptive distillation among modalities, and (2) fine-grained KD through a cross-modal dictionary matching mechanism, which aligns semantic granularities across modalities to produce more discriminative MER representations. This hierarchical distillation approach enables flexible knowledge transfer and effectively improves cross-modal feature alignment. Experimental results demonstrate that DHMD consistently outperforms state-of-the-art MER methods, achieving 1.3\%/2.4\% (ACC$_7$), 1.3\%/1.9\% (ACC$_2$) and 1.9\%/1.8\% (F1) relative improvement on CMU-MOSI/CMU-MOSEI dataset, respectively. Meanwhile, visualization results reveal that both the graph edges and dictionary activations in DHMD exhibit meaningful distribution patterns across modality-irrelevant/-exclusive feature spaces.
