OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan, Zhihao Du, Shiliang Zhang
TL;DR
OmniFlatten presents an end-to-end GPT-based approach for seamless full-duplex voice conversation by transforming a text LLM into a speech-text multimodal model through a multi-stage post-training pipeline and a unified flattening representation. The method blends modality alignment with ASR/TTS supervision, followed by progressively streaming-oriented half- and full-duplex dialogue training using chunked, interleaved data. Empirical evaluations on ASR/TTS quality, full-duplex capability, and turn-taking efficiency show competitive performance and superior latency relative to baselines, validating the practicality of end-to-end full-duplex dialogue without backbone modification. The work offers a scalable, data-driven pathway toward natural, real-time voice conversations and points to future enhancements in data realism, backchannels, and multi-modal extension such as vision.
Abstract
Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).
