Table of Contents
Fetching ...

TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Liang-Hsuan Tseng, Hung-yi Lee

Abstract

Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are much longer than text tokens. Prior work reduces this gap with text-aligned tokenization and embedding (TASTE), producing speech tokens that align in lengths with their textual counterparts. However, the dependence on an external ASR system and the use of a non-causal decoder limits streaming use. To address this limitation, we propose TASTE-S, a streamable extension of TASTE suitable for real-time usage. TASTE-S integrates a CTC-based ASR module into the encoder for instant dual-modality encoding. We also redesign the unit decoder to enable on-the-fly decoding. With joint training, we show that TASTE-S matches TASTE's performance while significantly reducing latency. Further investigations reveal that TASTE-S remains robust to transcriptions and enables long-form encoding and decoding.

TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Abstract

Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are much longer than text tokens. Prior work reduces this gap with text-aligned tokenization and embedding (TASTE), producing speech tokens that align in lengths with their textual counterparts. However, the dependence on an external ASR system and the use of a non-causal decoder limits streaming use. To address this limitation, we propose TASTE-S, a streamable extension of TASTE suitable for real-time usage. TASTE-S integrates a CTC-based ASR module into the encoder for instant dual-modality encoding. We also redesign the unit decoder to enable on-the-fly decoding. With joint training, we show that TASTE-S matches TASTE's performance while significantly reducing latency. Further investigations reveal that TASTE-S remains robust to transcriptions and enables long-form encoding and decoding.
Paper Structure (15 sections, 6 equations, 3 figures, 3 tables)

This paper contains 15 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison between conventional, TASTE and TASTE-S speech tokenizer. (a) conventional pipelines suffer from modality mismatch. (b) TASTE aligns speech tokens to text tokens but is non-streamable since an external offline ASR is required. (c) Our TASTE-S achieves aligned and streamable tokenization via a built-in ASR and other designs for streaming.
  • Figure 2: The framework overivew of our TASTE-S Tokenizer. On the left shows the Encoder with CTC integrated; on the right illustrates the streaming pattern for a streamable Decoder.
  • Figure 3: Visualization results. Left: reconstruction remains nearly unchanged when using the ground-truth transcript versus the CTC transcript (blue/red mark their mismatched parts). Right: the aggregator shows clear text–speech aligned cross-attention, consistent with TASTE.