Table of Contents
Fetching ...

Spoken Language Modeling with Duration-Penalized Self-Supervised Units

Nicol Visser, Herman Kamper

TL;DR

Spoken language models operate on discrete units derived from self-supervised speech representations, and the optimal balance between codebook size and unit duration is not well understood. The authors apply duration-penalized dynamic programming (DPDP) to produce coarser units from SSL features across a range of codebook sizes $\{100,200,500,1000\}$ and varying coarseness, and evaluate a unified SLM pipeline on phone, word, resynthesis, lexical, and syntactic tasks. They find that coarse units provide little benefit at the phone/word level, but improve resynthesis intelligibility and lexical/syntactic LM performance when the codebook is large enough. DPDP enables simple, efficient coarsening and, at scale, the DP-SLM system approaches the performance of state-of-the-art SLMs while being more memory-efficient. The results offer practical guidance on unit granularity for SLM design and highlight that task requirements dictate whether coarser units are advantageous.

Abstract

Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.

Spoken Language Modeling with Duration-Penalized Self-Supervised Units

TL;DR

Spoken language models operate on discrete units derived from self-supervised speech representations, and the optimal balance between codebook size and unit duration is not well understood. The authors apply duration-penalized dynamic programming (DPDP) to produce coarser units from SSL features across a range of codebook sizes and varying coarseness, and evaluate a unified SLM pipeline on phone, word, resynthesis, lexical, and syntactic tasks. They find that coarse units provide little benefit at the phone/word level, but improve resynthesis intelligibility and lexical/syntactic LM performance when the codebook is large enough. DPDP enables simple, efficient coarsening and, at scale, the DP-SLM system approaches the performance of state-of-the-art SLMs while being more memory-efficient. The results offer practical guidance on unit granularity for SLM design and highlight that task requirements dictate whether coarser units are advantageous.

Abstract

Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.

Paper Structure

This paper contains 11 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Coarser and coarser speech units obtained with DPDP. The units are labeled with the most likely phone from the forced alignments. The codebook size is $K=500$.
  • Figure 2: The spoken language model used in our experiments. The S2U encoder converts speech to discrete units, the unit language model is trained to predict the next unit, and the U2S decoder generates speech from the units.
  • Figure 3: Phone and word discrimination scores of the DPDP units (colored curves) compared to $K$-means units (black markers). (a)~ABX error rates. (b)~Same-different average precision (AP). Both tasks are evaluated on LibriSpeech~dev-clean.
  • Figure 4: WERs when resynthesizing the dev-clean subset of LibriSpeech using increasingly coarser DPDP units (colored curves) compared to the $K$-means units (black markers).
  • Figure 5: Language modeling performance of the DPDP units (colored curves) compared to the $K$-means units (black markers). (a) sWUGGY lexical accuracy. (b) sBLIMP syntactic accuracy. Both tasks are evaluated on sLM21 devdunbar2021zrc.