Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

Guan-Ting Lin; Prashanth Gurunath Shivakumar; Aditya Gourav; Yile Gu; Ankur Gandhe; Hung-yi Lee; Ivan Bulyko

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

Guan-Ting Lin, Prashanth Gurunath Shivakumar, Aditya Gourav, Yile Gu, Ankur Gandhe, Hung-yi Lee, Ivan Bulyko

TL;DR

This paper addresses the semantic gap between textless Spoken Language Models (SLMs) and text-based LLMs by introducing Align-SLM, a framework that uses Reinforcement Learning from AI Feedback (RLAIF)-inspired preference optimization to improve long-range semantics in speech. It generates multiple continuations from a prompt, creates automatic preference data using an LLM-based semantic evaluator (Mistral) and perplexity- or BLEU-based criteria, and trains with Direct Preference Optimization (DPO) coupled with curriculum learning. Across ZeroSpeech 2021 benchmarks and spoken StoryCloze tasks, Align-SLM achieves state-of-the-art results for textless SLMs, with notable gains in semantic coherence and continuation relevance, while maintaining audio quality. The approach demonstrates the importance of preference-based semantic alignment for end-to-end speech-to-speech models and points to scalable extensions, including multilingual support and broader paralinguistic aspects.

Abstract

While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

TL;DR

Abstract

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)