B-GRPO: Unsupervised Speech Emotion Recognition based on Batched-Group Relative Policy Optimization
Yingying Gao, Shilei Zhang, Runyan Yang, Zihao Cui, Junlan Feng
TL;DR
This work reframes unsupervised speech emotion recognition as a long-term reinforcement learning problem, where sample inclusion acts as the action in a policy update. It introduces Batched-GRPO, a batch-based adaptation of relative policy optimization that uses batch-average rewards as the baseline and combines self-reward and teacher-reward signals to encourage high-confidence predictions, with a modified advantage function that excludes negative values. Empirical results across five datasets show substantial improvements over baselines and prior unsupervised methods, with self-reward driving most gains and Whisper-based features yielding strong performance. The approach demonstrates that RL-driven sample selection can reduce reliance on labeled data in SER and can leverage in-domain or external data through selective sampling.
Abstract
Unsupervised speech emotion recognition (SER) focuses on addressing the problem of data sparsity and annotation bias of emotional speech. Reinforcement learning (RL) is a promising method which enhances the performance through rule-based or model-based verification functions rather than human annotations. We treat the sample selection during the learning process as a long-term procedure and whether to select a sample as the action to make policy, thus achieving the application of RL to measure sample quality in SER. We propose a modified Group Relative Policy Optimization (GRPO) to adapt it to classification problems, which takes the samples in a batch as a group and uses the average reward of these samples as the baseline to calculate the advantage. And rather than using a verifiable reward function as in GRPO, we put forward self-reward functions and teacher-reward functions to encourage the model to produce high-confidence outputs. Experiments indicate that the proposed method improves the performance of baseline without RL by 19.8%.
