Table of Contents
Fetching ...

Long-Form Speech Generation with Spoken Language Models

Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

TL;DR

The paper tackles the challenge of long-form, textless speech generation by introducing SpeechSSM, a state-space, decoder-only spoken language model capable of unbounded generation with a fixed-size state and linear-time complexity. It presents a two-stage semantic-acoustic pipeline (USM-v2 tokens and SoundStorm/SoundStream decoding) and windowed tokenization to manage long contexts, backed by the Griffin architecture for efficient inference. To evaluate long-form speech, the authors propose LibriSpeech-Long and new evaluation paradigms including embedding-based semantic metrics and LLM-based side-by-side judgments, showing SpeechSSM outperforms Transformer baselines on multi-minute generations and exhibits robust length extrapolation, with SpeechSSM-X extending to extemporaneous speech. The work also provides strong efficiency claims, achieving high throughput and favorable real-time factors, and releases datasets and samples to spur further research in audio-native long-form generation.

Abstract

We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce: LibriSpeech-Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found at https://google.github.io/tacotron/publications/speechssm/.

Long-Form Speech Generation with Spoken Language Models

TL;DR

The paper tackles the challenge of long-form, textless speech generation by introducing SpeechSSM, a state-space, decoder-only spoken language model capable of unbounded generation with a fixed-size state and linear-time complexity. It presents a two-stage semantic-acoustic pipeline (USM-v2 tokens and SoundStorm/SoundStream decoding) and windowed tokenization to manage long contexts, backed by the Griffin architecture for efficient inference. To evaluate long-form speech, the authors propose LibriSpeech-Long and new evaluation paradigms including embedding-based semantic metrics and LLM-based side-by-side judgments, showing SpeechSSM outperforms Transformer baselines on multi-minute generations and exhibits robust length extrapolation, with SpeechSSM-X extending to extemporaneous speech. The work also provides strong efficiency claims, achieving high throughput and favorable real-time factors, and releases datasets and samples to spur further research in audio-native long-form generation.

Abstract

We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce: LibriSpeech-Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found at https://google.github.io/tacotron/publications/speechssm/.

Paper Structure

This paper contains 27 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Maximum sequence lengths considered by various spoken LMs. Italicized models used text intermediates at generation time. Our models can generate indefinitely due to their constant memory footprint, but we cap our evaluations to 16 minutes.
  • Figure 2: Automated transcripts of 4min speech continuations generated by SpeechSSM-2B (ours) and a Spirit LM Expressive (7B) model nguyen2024spirit under slide-and-prompt generation (\ref{['sec:long-form-exps']}), extending a 10-second audio-only prompt from our proposed LibriSpeech-Long (test-clean). Aspects like recurring proper nouns show SpeechSSM's relative semantic consistency over time.
  • Figure 3: Overview of SpeechSSM. Left: A causally-masked hybrid state-space model (Griffin) is trained with an LM objective on semantic tokens (USM-v2) encoded via overlapping fixed-size windows. Right: A non-autoregressive synthesizer (SoundStorm) converts overlapping windows of semantic tokens to the acoustic tokens of a neural codec (SoundStream) in a speaker-conditioned manner.
  • Figure 4: Depiction of how input and output windowing work, shown here with 5-token window widths and 2-token overlaps.
  • Figure 5: Semantic similarity between a 10s prompt and the 100-word segment starting at word 100$L$ in the 16min completion, as measured by cosine similarity of Gecko embeddings (SC-$L$).
  • ...and 5 more figures