SpeechAlign: Aligning Speech Generation to Human Preferences

Dong Zhang; Zhaowei Li; Shimin Li; Xin Zhang; Pengyu Wang; Yaqian Zhou; Xipeng Qiu

SpeechAlign: Aligning Speech Generation to Human Preferences

Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu

TL;DR

This work identifies a distribution gap in neural codec language models that arises when the NAR stage encounters synthetic AR tokens during inference, degrading speech quality. It introduces SpeechAlign, an iterative self-improvement framework that builds a preference codec dataset by contrasting golden versus synthetic codec tokens and applies multiple preference optimization strategies to align outputs with human preferences without extra labeled data. Empirical results on LibriSpeech and VCTK show consistent improvements in both content accuracy (WER) and timbre consistency (SIM), with benefits extending to small AR models and unseen speakers, and improvements accumulate across iterations. By bridging the codec-token distribution gap through preference-guided learning, SpeechAlign enables scalable, continuous self-improvement of speech generation systems.

Abstract

Speech language models have significantly advanced in generating realistic speech, with neural codec language models standing out. However, the integration of human feedback to align speech outputs to human preferences is often neglected. This paper addresses this gap by first analyzing the distribution gap in codec language models, highlighting how it leads to discrepancies between the training and inference phases, which negatively affects performance. Then we explore leveraging learning from human feedback to bridge the distribution gap. We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences. SpeechAlign involves constructing a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, we show that SpeechAlign can bridge the distribution gap and facilitating continuous self-improvement of the speech language model. Moreover, SpeechAlign exhibits robust generalization capabilities and works for smaller models. Code and models will be available at https://github.com/0nutation/SpeechGPT.

SpeechAlign: Aligning Speech Generation to Human Preferences

TL;DR

Abstract

Paper Structure (20 sections, 4 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 4 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Preliminary Analysis on Distribution Gap
Background
Visualization of Distribution Gap
Distribution Gap Degrades Performance
SpeechAlign
Preference Data Collection
Preference Optimization
Iterative Self-Improvement
Experiments
Setups
Evaluation and Metrics
Main Results
Analysis
Ablation Studies
...and 5 more sections

Figures (4)

Figure 1: Qualitative side-by-side comparsion results of preference optimized models versus the baseline SFT model on zero-shot text-to-speech performance. SpeechAlign-RLHF-PPO denotes models optimized by RLHF using PPO algorithm. SpeechAlign-DPO-Iter1 denotes models optimized by Direct Preference Optimization method at the first iteration. SpeechAlign-DPO-Iter2 and SpeechAlign-DPO-Iter3 denote the models optimized at the second and third iterations, respectively. SpeechAlign-CoH represents models optimized by Chain-of-Hindsight strategy. SpeechAlign-BoN refers to baseline SFT model employing Best-of-N sampling method. SpeechAlign-BoN, SpeechAlign-RLHF-PPO and SpeechAlign-DPO series models significantly outperform baseline model on both LibriSpeech and VCTK dataset.
Figure 2: T-SNE visualization of representations of different AR tokens. Left: Golden AR tokens and synthetic AR tokens. Right: Golden AR tokens and aligned synthetic AR tokens.
Figure 3: AR LM refers to autoregressive models and NAR LM refers to non-autoregressive models. Left: Illustration of inference process of codec language models. Right: Illustration of SpeechAlign method.
Figure 4: Left: Performance of SpeechAlign across different preference data sizes. Right: Performance of SpeechAlign on small models.

SpeechAlign: Aligning Speech Generation to Human Preferences

TL;DR

Abstract

SpeechAlign: Aligning Speech Generation to Human Preferences

Authors

TL;DR

Abstract

Table of Contents

Figures (4)