Table of Contents
Fetching ...

Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu

TL;DR

GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities.

Abstract

While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities. By training on large corpora of speeches in an end-to-end unsupervised manner, GPST can generate syntactically consistent speech with diverse speaker identities. Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities. Moreover, our approach can be easily extended to spoken cross-lingual speech generation by incorporating multi-lingual semantic tokens and universal acoustic tokens. Experimental results indicate that GPST significantly outperforms the existing speech language models in terms of word error rate, speech quality, and speaker similarity. The code is available at \url{https://github.com/youngsheen/GPST}.

Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

TL;DR

GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities.

Abstract

While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities. By training on large corpora of speeches in an end-to-end unsupervised manner, GPST can generate syntactically consistent speech with diverse speaker identities. Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities. Moreover, our approach can be easily extended to spoken cross-lingual speech generation by incorporating multi-lingual semantic tokens and universal acoustic tokens. Experimental results indicate that GPST significantly outperforms the existing speech language models in terms of word error rate, speech quality, and speaker similarity. The code is available at \url{https://github.com/youngsheen/GPST}.
Paper Structure (23 sections, 7 equations, 3 figures, 5 tables)

This paper contains 23 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The comparison of frameworks for generative speech pre-training. (a) AudioLM is a three-stage model. (b) VALL-E is a two-stage model. (c) GPST is a one-stage model.
  • Figure 2: An overview of the framework. The framework is composed of three components: (1) The semantic token extractor with a speech SSL model and K-means for speech discretization. (2) The acoustic token extractor with the neural codec model for speech discretization. (3) The proposed GPST model, which is composed of a global transformer and a local transformer.
  • Figure 3: The comparison of mel-spectrograms generated by GPST with (a) 6kbps and (b) 12kbps (Hi-Res). The harmonic energy in the high-frequency of 12kbps is richer.