Table of Contents
Fetching ...

FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training

Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Wenjia Ma, Aixin Sun, Yequan Wang

TL;DR

FLM-Audio tackles latency and alignment in native full-duplex dialog by using contiguous monologues and a dual training paradigm. It leverages a 7B bilingual RQ-Transformer backbone to generate text and audio in parallel with sentence-level alignment, reducing annotation costs and data requirements. Through a four-stage training pipeline and ASR/TTS-style supervision, FLM-Audio achieves competitive or superior performance to state-of-the-art baselines with far less data, delivering improved naturalness, responsiveness, and robustness in full-duplex dialogue. These findings suggest native full-duplex architectures, when combined with contiguous monologues, offer practical benefits for responsive, multilingual spoken dialog systems and merit scaling to larger models.

Abstract

Full-duplex dialog models aim to listen and speak simultaneously, delivering rapid responses to dynamic user input. Among different solutions to full-duplexity, a native solution merges multiple channels in each time step, achieving the lowest latency. However, prevailing designs break down the textual monologue sentences for word-level alignment with audio streams, which degrades language modeling abilities. To help address this issue, we introduce "contiguous monologues", which are composed by continuous sentences and "waiting" intervals, mimicking human-like cognitive behavior in dialogs. We find a proper training paradigm to be critical for semantically aligning contiguous monologues with audio. To this end, we develop a "dual" training paradigm that alternates the position of the monologues, either leading or trailing the audio, across different training stages. A combination of our contiguous monologue and dual training strategy is applied in developing FLM-Audio, our 7B spoken dialog chatbot with native full-duplexity. As confirmed by experimental results, FLM-Audio achieves superior response qualities and chatting experiences while requiring significantly less training data.

FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training

TL;DR

FLM-Audio tackles latency and alignment in native full-duplex dialog by using contiguous monologues and a dual training paradigm. It leverages a 7B bilingual RQ-Transformer backbone to generate text and audio in parallel with sentence-level alignment, reducing annotation costs and data requirements. Through a four-stage training pipeline and ASR/TTS-style supervision, FLM-Audio achieves competitive or superior performance to state-of-the-art baselines with far less data, delivering improved naturalness, responsiveness, and robustness in full-duplex dialogue. These findings suggest native full-duplex architectures, when combined with contiguous monologues, offer practical benefits for responsive, multilingual spoken dialog systems and merit scaling to larger models.

Abstract

Full-duplex dialog models aim to listen and speak simultaneously, delivering rapid responses to dynamic user input. Among different solutions to full-duplexity, a native solution merges multiple channels in each time step, achieving the lowest latency. However, prevailing designs break down the textual monologue sentences for word-level alignment with audio streams, which degrades language modeling abilities. To help address this issue, we introduce "contiguous monologues", which are composed by continuous sentences and "waiting" intervals, mimicking human-like cognitive behavior in dialogs. We find a proper training paradigm to be critical for semantically aligning contiguous monologues with audio. To this end, we develop a "dual" training paradigm that alternates the position of the monologues, either leading or trailing the audio, across different training stages. A combination of our contiguous monologue and dual training strategy is applied in developing FLM-Audio, our 7B spoken dialog chatbot with native full-duplexity. As confirmed by experimental results, FLM-Audio achieves superior response qualities and chatting experiences while requiring significantly less training data.

Paper Structure

This paper contains 23 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: TDM vs. Native Full Duplexity for human-like responsiveness.
  • Figure 2: Stream organization for text and audio in FLM-Audio.
  • Figure 3: Training data token organization in different stages.