EMO-Codec: An In-Depth Look at Emotion Preservation capacity of Legacy and Neural Codec Models With Subjective and Objective Evaluations
Wenze Ren, Yi-Cheng Lin, Huang-Cheng Chou, Haibin Wu, Yi-Chiao Wu, Chi-Chun Lee, Hung-yi Lee, Yu Tsao
TL;DR
This study tackles the problem of preserving emotional information through speech codecs by conducting a comprehensive, multilingual evaluation of 14 neural codecs and 3 legacy codecs under varied bitrates. It integrates objective SER performance using SSL-based representations and human subjective tests on IEMOCAP and EMO-SUPERB-derived datasets to benchmark emotion retention. The findings show neural codecs, particularly DAC and SpeechTokenizer, generally outperform legacy codecs at similar bitrates, though resynthesis can significantly degrade challenging emotions like fear and sadness; bilingual training yields limited gains for Chinese emotion preservation. The work provides benchmarks and practical guidance for designing codecs that maintain affective cues, with implications for speech LMs and real-world applications requiring emotion-aware communication.
Abstract
The neural codec model reduces speech data transmission delay and serves as the foundational tokenizer for speech language models (speech LMs). Preserving emotional information in codecs is crucial for effective communication and context understanding. However, there is a lack of studies on emotion loss in existing codecs. This paper evaluates neural and legacy codecs using subjective and objective methods on emotion datasets like IEMOCAP. Our study identifies which codecs best preserve emotional information under various bitrate scenarios. We found that training codec models with both English and Chinese data had limited success in retaining emotional information in Chinese. Additionally, resynthesizing speech through these codecs degrades the performance of speech emotion recognition (SER), particularly for emotions like sadness, depression, fear, and disgust. Human listening tests confirmed these findings. This work guides future speech technology developments to ensure new codecs maintain the integrity of emotional information in speech.
