Table of Contents
Fetching ...

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Liang-Hsuan Tseng, Yi-Chang Chen, Kuan-Yi Lee, Da-Shan Shiu, Hung-yi Lee

TL;DR

TASTE introduces text-aligned speech tokenization and embedding to bridge the modality gap in joint speech-text modeling. By using a cross-attention-based aggregator and an automatic reconstruction objective, TASTE yields text-aligned speech tokens and embeddings with markedly reduced bitrate (about $150$ bps) while preserving paralinguistic cues. This enables straightforward joint spoken language modeling (TASLM) and effective fine-tuning of a text LLM via LoRA, achieving strong performance on speech continuation and QA benchmarks, and enabling text-aligned speech editing. The approach demonstrates that end-to-end reconstruction-based tokenization can outperform existing pretrained SLMs and offers practical impact for turning text LLMs into capable spoken agents with parameter-efficient adaptation.

Abstract

Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint speech-text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. With TASTE, we perform straightforward joint spoken language modeling by using Low-Rank Adaptation on the pre-trained text LLM. Experimental results show that TASTE-based SLMs perform comparable to previous work on SALMON and StoryCloze; while significantly outperform other pre-trained SLMs on speech continuation across subjective and objective evaluations. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling. Our demo, code, and model are available at https://mtkresearch.github.io/TASTE-SpokenLM.github.io.

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

TL;DR

TASTE introduces text-aligned speech tokenization and embedding to bridge the modality gap in joint speech-text modeling. By using a cross-attention-based aggregator and an automatic reconstruction objective, TASTE yields text-aligned speech tokens and embeddings with markedly reduced bitrate (about bps) while preserving paralinguistic cues. This enables straightforward joint spoken language modeling (TASLM) and effective fine-tuning of a text LLM via LoRA, achieving strong performance on speech continuation and QA benchmarks, and enabling text-aligned speech editing. The approach demonstrates that end-to-end reconstruction-based tokenization can outperform existing pretrained SLMs and offers practical impact for turning text LLMs into capable spoken agents with parameter-efficient adaptation.

Abstract

Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint speech-text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. With TASTE, we perform straightforward joint spoken language modeling by using Low-Rank Adaptation on the pre-trained text LLM. Experimental results show that TASTE-based SLMs perform comparable to previous work on SALMON and StoryCloze; while significantly outperform other pre-trained SLMs on speech continuation across subjective and objective evaluations. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling. Our demo, code, and model are available at https://mtkresearch.github.io/TASTE-SpokenLM.github.io.

Paper Structure

This paper contains 41 sections, 11 equations, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: The concept overview. Conventional methods extract speech tokens solely from speech, inducing length-mismatch problem when conducting joint speech-text modeling. By taking dual modalities as input, we generate speech tokenization that is aligned with text, facilitating straightforward and effective joint modeling.
  • Figure 2: The overall framework of our text-aligned speech tokenization and embedding. The left side illustrate the process of obtaining the TASTE tokenization $\hat{\bm{z}}$, detailed in Section \ref{['subsubsec:taste_speech_tokenizer']}; while the right side demonstrate how we reconstruct the speech with TASTE (Section \ref{['subsubsec:taste_speech_decoder']}). The training objective for our speech reconstruction is discussed in Section \ref{['subsubsec:training_objective']}.
  • Figure 3: An illustration of TASTE for text-aligned speech editing. On the left shows the process of our text-aligned speech editing. We first extract the TASTE tokens; swap the tokens partially; and then decode the edited TASTE tokens into edited speech. On the right shows an example visualization. Only the durations of the words with exchanged TASTE tokens show significant difference.