FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training
Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Wenjia Ma, Aixin Sun, Yequan Wang
TL;DR
FLM-Audio tackles latency and alignment in native full-duplex dialog by using contiguous monologues and a dual training paradigm. It leverages a 7B bilingual RQ-Transformer backbone to generate text and audio in parallel with sentence-level alignment, reducing annotation costs and data requirements. Through a four-stage training pipeline and ASR/TTS-style supervision, FLM-Audio achieves competitive or superior performance to state-of-the-art baselines with far less data, delivering improved naturalness, responsiveness, and robustness in full-duplex dialogue. These findings suggest native full-duplex architectures, when combined with contiguous monologues, offer practical benefits for responsive, multilingual spoken dialog systems and merit scaling to larger models.
Abstract
Full-duplex dialog models aim to listen and speak simultaneously, delivering rapid responses to dynamic user input. Among different solutions to full-duplexity, a native solution merges multiple channels in each time step, achieving the lowest latency. However, prevailing designs break down the textual monologue sentences for word-level alignment with audio streams, which degrades language modeling abilities. To help address this issue, we introduce "contiguous monologues", which are composed by continuous sentences and "waiting" intervals, mimicking human-like cognitive behavior in dialogs. We find a proper training paradigm to be critical for semantically aligning contiguous monologues with audio. To this end, we develop a "dual" training paradigm that alternates the position of the monologues, either leading or trailing the audio, across different training stages. A combination of our contiguous monologue and dual training strategy is applied in developing FLM-Audio, our 7B spoken dialog chatbot with native full-duplexity. As confirmed by experimental results, FLM-Audio achieves superior response qualities and chatting experiences while requiring significantly less training data.
