Emotion-Aware Quantization for Discrete Speech Representations: An Analysis of Emotion Preservation

Haoguang Zhou; Siyi Wang; Jingyao Wu; James Bailey; Ting Dang

Emotion-Aware Quantization for Discrete Speech Representations: An Analysis of Emotion Preservation

Haoguang Zhou, Siyi Wang, Jingyao Wu, James Bailey, Ting Dang

Abstract

Modern speech systems increasingly use discretized self-supervised speech representations for compression and integration with token-based models, yet their impact on emotional information remains unclear. We study how residual vector quantization (RVQ) reshapes emotional information in discrete speech representations from both representation- and task-level perspectives. Our analysis shows that aggressive compression disproportionately degrades emotion, with uneven loss across emotion classes and model architectures. To address this, we introduce emotion-aware quantization using emotion-specific and emotion-biased codebooks, improving the preservation of both hard and soft emotion perception. We further propose Emo-Q, a lightweight routed quantization method that selects emotion-specialized codebooks, improving emotion recognition performance at lower bitrates. These results highlight the importance of emotion-aware discretization for robust affective speech processing.

Emotion-Aware Quantization for Discrete Speech Representations: An Analysis of Emotion Preservation

Abstract

Paper Structure (12 sections, 1 equation, 5 figures, 1 table)

This paper contains 12 sections, 1 equation, 5 figures, 1 table.

Introduction
Related Work
Experimental Setup
Data and Quantization
Evaluations & Metrics
Results and Findings
RQ1: How is emotional information structured and degraded under RVQ across different SSL architectures?
RQ2: Can emotion-specific codebook training improve affective preservation?
RQ3: Does emotion-specific quantization preserve fine-grained affective structure under compression?
Downstream Utility: Emotion-aware SER with Routed Quantization
Conclusion
Generative AI Use Disclosure

Figures (5)

Figure 1: Quantization of discrete speech representation
Figure 2: Overview of the pipeline. (a) Balanced and emotion-specific codebooks. (b) Representation-level evaluation: layer-wise degradation analysis (RQ1), primary emotion retention (RQ2), and soft distribution fidelity (RQ3). (c) Task-level evaluation: downstream SER via routed quantization (RQ4).
Figure 3: Reconstruction fidelity (cosine similarity, top) and primary emotion recall (bottom) versus quantization depth under RVQ for three different SSL frontends.
Figure 4: RQ2: affective retention (left) and codebook utilization (right) for balanced and emotion-specific quantization.
Figure 5: RQ3 evaluation: emotion distribution matching (left) and top2 emotion recall (right) for varying codebook training strategies.

Emotion-Aware Quantization for Discrete Speech Representations: An Analysis of Emotion Preservation

Abstract

Emotion-Aware Quantization for Discrete Speech Representations: An Analysis of Emotion Preservation

Authors

Abstract

Table of Contents

Figures (5)