SpeechAlign: Aligning Speech Generation to Human Preferences
Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu
TL;DR
This work identifies a distribution gap in neural codec language models that arises when the NAR stage encounters synthetic AR tokens during inference, degrading speech quality. It introduces SpeechAlign, an iterative self-improvement framework that builds a preference codec dataset by contrasting golden versus synthetic codec tokens and applies multiple preference optimization strategies to align outputs with human preferences without extra labeled data. Empirical results on LibriSpeech and VCTK show consistent improvements in both content accuracy (WER) and timbre consistency (SIM), with benefits extending to small AR models and unseen speakers, and improvements accumulate across iterations. By bridging the codec-token distribution gap through preference-guided learning, SpeechAlign enables scalable, continuous self-improvement of speech generation systems.
Abstract
Speech language models have significantly advanced in generating realistic speech, with neural codec language models standing out. However, the integration of human feedback to align speech outputs to human preferences is often neglected. This paper addresses this gap by first analyzing the distribution gap in codec language models, highlighting how it leads to discrepancies between the training and inference phases, which negatively affects performance. Then we explore leveraging learning from human feedback to bridge the distribution gap. We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences. SpeechAlign involves constructing a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, we show that SpeechAlign can bridge the distribution gap and facilitating continuous self-improvement of the speech language model. Moreover, SpeechAlign exhibits robust generalization capabilities and works for smaller models. Code and models will be available at https://github.com/0nutation/SpeechGPT.
