Table of Contents
Fetching ...

Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information

Nicholas Sanders, Yuanchao Li, Korin Richmond, Simon King

TL;DR

The paper addresses the challenge that discrete speech units (DSUs) from SSL models often lose paralinguistic and prosodic information. It introduces Segmentation-Variant Codebooks (SVCs), which quantize speech at multiple linguistic levels (frame, phone, word, utterance) to produce four parallel DSU streams that factorize information by segment. Experimental results show that pre-discretization pooling improves preservation of prosody and emotion, and that SVCs outperform fixed-frame baselines in probing tasks and expressivity in resynthesis, though continuous representations remain the strongest baseline. Overall, SVCs offer an efficient approach to richer DSU representations with better prosody preservation and provide a foundation for future improvements in pooling strategies, segmentation methods, and multi-layer integration.

Abstract

Quantization in SSL speech models (e.g., HuBERT) improves compression and performance in tasks like language modeling, resynthesis, and text-to-speech but often discards prosodic and paralinguistic information (e.g., emotion, prominence). While increasing codebook size mitigates some loss, it inefficiently raises bitrates. We propose Segmentation-Variant Codebooks (SVCs), which quantize speech at distinct linguistic units (frame, phone, word, utterance), factorizing it into multiple streams of segment-specific discrete features. Our results show that SVCs are significantly more effective at preserving prosodic and paralinguistic information across probing tasks. Additionally, we find that pooling before rather than after discretization better retains segment-level information. Resynthesis experiments further confirm improved style realization and slightly improved quality while preserving intelligibility.

Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information

TL;DR

The paper addresses the challenge that discrete speech units (DSUs) from SSL models often lose paralinguistic and prosodic information. It introduces Segmentation-Variant Codebooks (SVCs), which quantize speech at multiple linguistic levels (frame, phone, word, utterance) to produce four parallel DSU streams that factorize information by segment. Experimental results show that pre-discretization pooling improves preservation of prosody and emotion, and that SVCs outperform fixed-frame baselines in probing tasks and expressivity in resynthesis, though continuous representations remain the strongest baseline. Overall, SVCs offer an efficient approach to richer DSU representations with better prosody preservation and provide a foundation for future improvements in pooling strategies, segmentation methods, and multi-layer integration.

Abstract

Quantization in SSL speech models (e.g., HuBERT) improves compression and performance in tasks like language modeling, resynthesis, and text-to-speech but often discards prosodic and paralinguistic information (e.g., emotion, prominence). While increasing codebook size mitigates some loss, it inefficiently raises bitrates. We propose Segmentation-Variant Codebooks (SVCs), which quantize speech at distinct linguistic units (frame, phone, word, utterance), factorizing it into multiple streams of segment-specific discrete features. Our results show that SVCs are significantly more effective at preserving prosodic and paralinguistic information across probing tasks. Additionally, we find that pooling before rather than after discretization better retains segment-level information. Resynthesis experiments further confirm improved style realization and slightly improved quality while preserving intelligibility.

Paper Structure

This paper contains 21 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Segmentation-Variant Codebooks processing with pre-pooling overview. $h_n$ refers to continuous hidden representations at the $n$th frame, $S_n$ refers to the resulting frame-level stream obtained from mean pooling multiple streams of DSUs based on overlapping segmentation.