Table of Contents
Fetching ...

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

Guan-Ting Lin, Prashanth Gurunath Shivakumar, Aditya Gourav, Yile Gu, Ankur Gandhe, Hung-yi Lee, Ivan Bulyko

TL;DR

This paper addresses the semantic gap between textless Spoken Language Models (SLMs) and text-based LLMs by introducing Align-SLM, a framework that uses Reinforcement Learning from AI Feedback (RLAIF)-inspired preference optimization to improve long-range semantics in speech. It generates multiple continuations from a prompt, creates automatic preference data using an LLM-based semantic evaluator (Mistral) and perplexity- or BLEU-based criteria, and trains with Direct Preference Optimization (DPO) coupled with curriculum learning. Across ZeroSpeech 2021 benchmarks and spoken StoryCloze tasks, Align-SLM achieves state-of-the-art results for textless SLMs, with notable gains in semantic coherence and continuation relevance, while maintaining audio quality. The approach demonstrates the importance of preference-based semantic alignment for end-to-end speech-to-speech models and points to scalable extensions, including multilingual support and broader paralinguistic aspects.

Abstract

While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

TL;DR

This paper addresses the semantic gap between textless Spoken Language Models (SLMs) and text-based LLMs by introducing Align-SLM, a framework that uses Reinforcement Learning from AI Feedback (RLAIF)-inspired preference optimization to improve long-range semantics in speech. It generates multiple continuations from a prompt, creates automatic preference data using an LLM-based semantic evaluator (Mistral) and perplexity- or BLEU-based criteria, and trains with Direct Preference Optimization (DPO) coupled with curriculum learning. Across ZeroSpeech 2021 benchmarks and spoken StoryCloze tasks, Align-SLM achieves state-of-the-art results for textless SLMs, with notable gains in semantic coherence and continuation relevance, while maintaining audio quality. The approach demonstrates the importance of preference-based semantic alignment for end-to-end speech-to-speech models and points to scalable extensions, including multilingual support and broader paralinguistic aspects.

Abstract

While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.

Paper Structure

This paper contains 38 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The illustration of the Align-SLM framework. Firstly, tokenize the speech prompt into speech tokens using a speech tokenizer. Then, generate and sample multiple speech continuations from the pre-trained SLM. To create the preference data pairs, the framework uses a unit-based vocoder to synthesize the speech continuations back to waveforms, an ASR model to transcribe the waveforms to text, and an LLM evaluator to rate the quality of the semantics. These preference data pairs are then used for direct preference optimization to train the LoRA adapter in the SLM. This alignment process can be coupled with curriculum learning to further improve performance.
  • Figure 2: The 7B model undergoes one, two, and up to three iterations of curriculum learning with Align-SLM using Librispeech data.
  • Figure 3: The normalized histogram of Mistral score distribution on Librispeech dev-clean with 7B model.
  • Figure 4: The normalized histogram of Mistral score distribution on Librispeech dev-clean with 1.3B model.
  • Figure 5: The log normalized histogram of auto-BLEU score distribution on Librispeech dev-clean with 1.3B TWIST model.
  • ...and 1 more figures