Table of Contents
Fetching ...

LAST: Language Model Aware Speech Tokenization

Arnon Turetzky, Yossi Adi

TL;DR

LAST addresses the misalignment between speech tokenization and downstream language modeling by coupling a frozen text LM with a speech tokenizer. The framework uses a frozen HuBERT-based encoder, a learnable adaptor-quantizer, and vector quantization to produce discrete tokens that are optimized via a next-token objective $\mathcal{L}_{LM}$ together with a reconstruction loss, while keeping the LM fixed. Empirically, LAST improves zero-resource sequence modeling and ASR WER over k-means baselines and maintains text-model capabilities; it also demonstrates that a single pre-trained LM can process both speech and text inputs. However, ABX results favor conventional acoustic tokenization for phoneme-level discrimination, and LAST incurs higher compute than traditional tokenizers, motivating future work on unit-to-speech synthesis and broader multilingual evaluation.

Abstract

Speech tokenization serves as the foundation of speech language model (LM), enabling them to perform various tasks such as spoken language modeling, text-to-speech, speech-to-text, etc. Most speech tokenizers are trained independently of the LM training process, relying on separate acoustic models and quantization methods. Following such an approach may create a mismatch between the tokenization process and its usage afterward. In this study, we propose a novel approach to training a speech tokenizer by leveraging objectives from pre-trained textual LMs. We advocate for the integration of this objective into the process of learning discrete speech representations. Our aim is to transform features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs. We empirically investigate the impact of various model design choices, including speech vocabulary size and text LM size. Our results demonstrate the proposed tokenization method outperforms the evaluated baselines considering both spoken language modeling and speech-to-text. More importantly, unlike prior work, the proposed method allows the utilization of a single pre-trained LM for processing both speech and text inputs, setting it apart from conventional tokenization approaches.

LAST: Language Model Aware Speech Tokenization

TL;DR

LAST addresses the misalignment between speech tokenization and downstream language modeling by coupling a frozen text LM with a speech tokenizer. The framework uses a frozen HuBERT-based encoder, a learnable adaptor-quantizer, and vector quantization to produce discrete tokens that are optimized via a next-token objective together with a reconstruction loss, while keeping the LM fixed. Empirically, LAST improves zero-resource sequence modeling and ASR WER over k-means baselines and maintains text-model capabilities; it also demonstrates that a single pre-trained LM can process both speech and text inputs. However, ABX results favor conventional acoustic tokenization for phoneme-level discrimination, and LAST incurs higher compute than traditional tokenizers, motivating future work on unit-to-speech synthesis and broader multilingual evaluation.

Abstract

Speech tokenization serves as the foundation of speech language model (LM), enabling them to perform various tasks such as spoken language modeling, text-to-speech, speech-to-text, etc. Most speech tokenizers are trained independently of the LM training process, relying on separate acoustic models and quantization methods. Following such an approach may create a mismatch between the tokenization process and its usage afterward. In this study, we propose a novel approach to training a speech tokenizer by leveraging objectives from pre-trained textual LMs. We advocate for the integration of this objective into the process of learning discrete speech representations. Our aim is to transform features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs. We empirically investigate the impact of various model design choices, including speech vocabulary size and text LM size. Our results demonstrate the proposed tokenization method outperforms the evaluated baselines considering both spoken language modeling and speech-to-text. More importantly, unlike prior work, the proposed method allows the utilization of a single pre-trained LM for processing both speech and text inputs, setting it apart from conventional tokenization approaches.
Paper Structure (15 sections, 2 equations, 3 figures, 5 tables)

This paper contains 15 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: A visual description of LAST. We propose to leverage a pre-trained text-LM to construct a speech tokenizer. LAST receives gradients from the LM to guide the tokenization process toward better sequence modeling. Pretrained freezed modules are blue and learned modules are green.
  • Figure 2: The common SpeechLM pipeline. First, discrete representation is extracted from the raw waveform using both a speech encoder and a quantization module (often known as a speech tokenizer). This representation is later used for training a uLM. In this study, we focus on the speech tokenization part.
  • Figure 3: Units visualization. Each bounded area represents a single unit out of 200 and is colored by the unit’s phoneme family.