PCQ: Emotion Recognition in Speech via Progressive Channel Querying
Xincheng Wang, Liejun Wang, Yinfeng Yu, Xinxin Jiao
TL;DR
The paper tackles the challenge of capturing long-term temporal correlations in speech emotion recognition (SER) by introducing Progressive Channel Querying (PCQ), a framework that progressively queries channel-wise emotion information across network layers. PCQ combines a spectrogram-based Multilayer Lightweight CNN (MLCNN) and a pre-trained WavLM encoder, connected by a Channel Semantic Query (CSQ) module that aggregates semantically similar emotion features across layers to model dynamic emotional trajectories. The method demonstrates significant WA/UA improvements on IEMOCAP and EMODB with a reduced parameter footprint, and ablation confirms that the CSQ and WavLM components are essential to the gains. These results highlight the practical potential of progressive, channel-focused reasoning for SER and lay groundwork for future multimodal extensions in multi-scene contexts.
Abstract
In human-computer interaction (HCI), Speech Emotion Recognition (SER) is a key technology for understanding human intentions and emotions. Traditional SER methods struggle to effectively capture the long-term temporal correla-tions and dynamic variations in complex emotional expressions. To overcome these limitations, we introduce the PCQ method, a pioneering approach for SER via \textbf{P}rogressive \textbf{C}hannel \textbf{Q}uerying. This method can drill down layer by layer in the channel dimension through the channel query technique to achieve dynamic modeling of long-term contextual information of emotions. This mul-ti-level analysis gives the PCQ method an edge in capturing the nuances of hu-man emotions. Experimental results show that our model improves the weighted average (WA) accuracy by 3.98\% and 3.45\% and the unweighted av-erage (UA) accuracy by 5.67\% and 5.83\% on the IEMOCAP and EMODB emotion recognition datasets, respectively, significantly exceeding the baseline levels.
