Table of Contents
Fetching ...

PCQ: Emotion Recognition in Speech via Progressive Channel Querying

Xincheng Wang, Liejun Wang, Yinfeng Yu, Xinxin Jiao

TL;DR

The paper tackles the challenge of capturing long-term temporal correlations in speech emotion recognition (SER) by introducing Progressive Channel Querying (PCQ), a framework that progressively queries channel-wise emotion information across network layers. PCQ combines a spectrogram-based Multilayer Lightweight CNN (MLCNN) and a pre-trained WavLM encoder, connected by a Channel Semantic Query (CSQ) module that aggregates semantically similar emotion features across layers to model dynamic emotional trajectories. The method demonstrates significant WA/UA improvements on IEMOCAP and EMODB with a reduced parameter footprint, and ablation confirms that the CSQ and WavLM components are essential to the gains. These results highlight the practical potential of progressive, channel-focused reasoning for SER and lay groundwork for future multimodal extensions in multi-scene contexts.

Abstract

In human-computer interaction (HCI), Speech Emotion Recognition (SER) is a key technology for understanding human intentions and emotions. Traditional SER methods struggle to effectively capture the long-term temporal correla-tions and dynamic variations in complex emotional expressions. To overcome these limitations, we introduce the PCQ method, a pioneering approach for SER via \textbf{P}rogressive \textbf{C}hannel \textbf{Q}uerying. This method can drill down layer by layer in the channel dimension through the channel query technique to achieve dynamic modeling of long-term contextual information of emotions. This mul-ti-level analysis gives the PCQ method an edge in capturing the nuances of hu-man emotions. Experimental results show that our model improves the weighted average (WA) accuracy by 3.98\% and 3.45\% and the unweighted av-erage (UA) accuracy by 5.67\% and 5.83\% on the IEMOCAP and EMODB emotion recognition datasets, respectively, significantly exceeding the baseline levels.

PCQ: Emotion Recognition in Speech via Progressive Channel Querying

TL;DR

The paper tackles the challenge of capturing long-term temporal correlations in speech emotion recognition (SER) by introducing Progressive Channel Querying (PCQ), a framework that progressively queries channel-wise emotion information across network layers. PCQ combines a spectrogram-based Multilayer Lightweight CNN (MLCNN) and a pre-trained WavLM encoder, connected by a Channel Semantic Query (CSQ) module that aggregates semantically similar emotion features across layers to model dynamic emotional trajectories. The method demonstrates significant WA/UA improvements on IEMOCAP and EMODB with a reduced parameter footprint, and ablation confirms that the CSQ and WavLM components are essential to the gains. These results highlight the practical potential of progressive, channel-focused reasoning for SER and lay groundwork for future multimodal extensions in multi-scene contexts.

Abstract

In human-computer interaction (HCI), Speech Emotion Recognition (SER) is a key technology for understanding human intentions and emotions. Traditional SER methods struggle to effectively capture the long-term temporal correla-tions and dynamic variations in complex emotional expressions. To overcome these limitations, we introduce the PCQ method, a pioneering approach for SER via \textbf{P}rogressive \textbf{C}hannel \textbf{Q}uerying. This method can drill down layer by layer in the channel dimension through the channel query technique to achieve dynamic modeling of long-term contextual information of emotions. This mul-ti-level analysis gives the PCQ method an edge in capturing the nuances of hu-man emotions. Experimental results show that our model improves the weighted average (WA) accuracy by 3.98\% and 3.45\% and the unweighted av-erage (UA) accuracy by 5.67\% and 5.83\% on the IEMOCAP and EMODB emotion recognition datasets, respectively, significantly exceeding the baseline levels.
Paper Structure (12 sections, 6 equations, 4 figures, 4 tables)

This paper contains 12 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (A) The overall PCQ framework. (B) The MLCNN network. (C) The PDC module.
  • Figure 2: CSQ module. The letters on the arrows in the diagram represent the following: C denotes that the convolution operation is first performed to match the number of channels $C_{h}$ of the high-level features with the number of channels $C_{l}$ of the low-level features; B denotes that the height H and width W of the feature map are adjusted to keep the high-level features ($H_{h}$,$W_{h}$) and the low-level features ($H_{l}$,$W_{l}$) the same using bilinear interpolation; and S denotes that the channel dimensions are segmented; d is the dilation rate of the convolution.
  • Figure 3: Visualisation of MLCNN results for different layers on the IEMOCAP dataset: blue for weighted accuracy (WA), green for unweighted accuracy (UA).
  • Figure 4: (Top) The t-SNE visualization of feature distribution on the IEMOCAP dataset. (Bottom) Comparison of normalized confusion matrices on the IEMOCAP dataset.