Table of Contents
Fetching ...

EMO-Codec: An In-Depth Look at Emotion Preservation capacity of Legacy and Neural Codec Models With Subjective and Objective Evaluations

Wenze Ren, Yi-Cheng Lin, Huang-Cheng Chou, Haibin Wu, Yi-Chiao Wu, Chi-Chun Lee, Hung-yi Lee, Yu Tsao

TL;DR

This study tackles the problem of preserving emotional information through speech codecs by conducting a comprehensive, multilingual evaluation of 14 neural codecs and 3 legacy codecs under varied bitrates. It integrates objective SER performance using SSL-based representations and human subjective tests on IEMOCAP and EMO-SUPERB-derived datasets to benchmark emotion retention. The findings show neural codecs, particularly DAC and SpeechTokenizer, generally outperform legacy codecs at similar bitrates, though resynthesis can significantly degrade challenging emotions like fear and sadness; bilingual training yields limited gains for Chinese emotion preservation. The work provides benchmarks and practical guidance for designing codecs that maintain affective cues, with implications for speech LMs and real-world applications requiring emotion-aware communication.

Abstract

The neural codec model reduces speech data transmission delay and serves as the foundational tokenizer for speech language models (speech LMs). Preserving emotional information in codecs is crucial for effective communication and context understanding. However, there is a lack of studies on emotion loss in existing codecs. This paper evaluates neural and legacy codecs using subjective and objective methods on emotion datasets like IEMOCAP. Our study identifies which codecs best preserve emotional information under various bitrate scenarios. We found that training codec models with both English and Chinese data had limited success in retaining emotional information in Chinese. Additionally, resynthesizing speech through these codecs degrades the performance of speech emotion recognition (SER), particularly for emotions like sadness, depression, fear, and disgust. Human listening tests confirmed these findings. This work guides future speech technology developments to ensure new codecs maintain the integrity of emotional information in speech.

EMO-Codec: An In-Depth Look at Emotion Preservation capacity of Legacy and Neural Codec Models With Subjective and Objective Evaluations

TL;DR

This study tackles the problem of preserving emotional information through speech codecs by conducting a comprehensive, multilingual evaluation of 14 neural codecs and 3 legacy codecs under varied bitrates. It integrates objective SER performance using SSL-based representations and human subjective tests on IEMOCAP and EMO-SUPERB-derived datasets to benchmark emotion retention. The findings show neural codecs, particularly DAC and SpeechTokenizer, generally outperform legacy codecs at similar bitrates, though resynthesis can significantly degrade challenging emotions like fear and sadness; bilingual training yields limited gains for Chinese emotion preservation. The work provides benchmarks and practical guidance for designing codecs that maintain affective cues, with implications for speech LMs and real-world applications requiring emotion-aware communication.

Abstract

The neural codec model reduces speech data transmission delay and serves as the foundational tokenizer for speech language models (speech LMs). Preserving emotional information in codecs is crucial for effective communication and context understanding. However, there is a lack of studies on emotion loss in existing codecs. This paper evaluates neural and legacy codecs using subjective and objective methods on emotion datasets like IEMOCAP. Our study identifies which codecs best preserve emotional information under various bitrate scenarios. We found that training codec models with both English and Chinese data had limited success in retaining emotional information in Chinese. Additionally, resynthesizing speech through these codecs degrades the performance of speech emotion recognition (SER), particularly for emotions like sadness, depression, fear, and disgust. Human listening tests confirmed these findings. This work guides future speech technology developments to ensure new codecs maintain the integrity of emotional information in speech.
Paper Structure (17 sections, 8 figures, 4 tables)

This paper contains 17 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The pipeline of Emo-Codec. The process starts by training the SSL model for emotion recognition using only the original audio. Then, we inference the testset of the original audio, and the audio resynthesized with codecs. We calculate the $F_1$ score difference of emotion recognition to get the objective emotion loss. We used the original audio files and the resynthesized audio files from the codec model to do the human subjective listening test for emotion recognition.
  • Figure 2: Legends for all codecs throughout this paper
  • Figure 3: Emotion recognition performance in macro-$F_1$ score on different codecs with the IEMOCAP dataset. The red dashed line represents the SER performance of the original audio.
  • Figure 4: Emotion recognition performance (macro-$F_1$) on different codec with (a) CREMA-D (b) IMPORV (c) PODCAST dataset. The red dashed line represents the SER performance of the original audio.
  • Figure 5: Emotion recognition performance (macro-$F_1$) on different codec with (a) NNIME (b) BIIC-PODCAST Chinese datasets. The red dashed line represents the SER performance of the original audio.
  • ...and 3 more figures