Table of Contents
Fetching ...

Estimating the Completeness of Discrete Speech Units

Sung-Lin Yeh, Hao Tang

TL;DR

This work shows a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization, and finds that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual.

Abstract

Representing speech with discrete units has been widely used in speech codec and speech generation. However, there are several unverified claims about self-supervised discrete units, such as disentangling phonetic and speaker information with k-means, or assuming information loss after k-means. In this work, we take an information-theoretic perspective to answer how much information is present (information completeness) and how much information is accessible (information accessibility), before and after residual vector quantization. We show a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization. We find that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve disentanglement. Our results offer a comprehensive assessment on the choice of discrete units, and suggest that a lot more information in the residual should be mined rather than discarded.

Estimating the Completeness of Discrete Speech Units

TL;DR

This work shows a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization, and finds that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual.

Abstract

Representing speech with discrete units has been widely used in speech codec and speech generation. However, there are several unverified claims about self-supervised discrete units, such as disentangling phonetic and speaker information with k-means, or assuming information loss after k-means. In this work, we take an information-theoretic perspective to answer how much information is present (information completeness) and how much information is accessible (information accessibility), before and after residual vector quantization. We show a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization. We find that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve disentanglement. Our results offer a comprehensive assessment on the choice of discrete units, and suggest that a lot more information in the residual should be mined rather than discarded.
Paper Structure (20 sections, 4 equations, 5 figures, 3 tables)

This paper contains 20 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An illustration of information accessibility. $R_A$ and $R_B$ are two representations, and their probing errors differ depending on the model capacity of the probes. Under a linear probe, information in $R_B$ is more accessible than $R_A$ with a lower error.
  • Figure 2: Frames of HuBERT representations assigned to two example k-means clusters are visualized with the first two principle components of PCA. Colors represent speaker identifies.
  • Figure 3: The completeness and accessibility of representations at different rates (bits per frame). We vary the depth of RVQ from $L=1$ to $L=8$. Representations are quantized at a cost of 10 bits per codebook, corresponding to a codebook size $N=1024$. Codebooks are not fine-tuned.
  • Figure 4: An example of the reconstructed log Mels with HuBERT L4 representations and their discrete units. The distortion (MSE) decreases from left to right. Details over 20 Mel bands are better captured in (c) and (d). The ground truth is shown in Figure \ref{['fig:mel-gt']}.
  • Figure 5: The ground truth utterance for Figure \ref{['fig:mels']}.