Table of Contents
Fetching ...

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan, Zhihao Du, Shiliang Zhang

TL;DR

OmniFlatten presents an end-to-end GPT-based approach for seamless full-duplex voice conversation by transforming a text LLM into a speech-text multimodal model through a multi-stage post-training pipeline and a unified flattening representation. The method blends modality alignment with ASR/TTS supervision, followed by progressively streaming-oriented half- and full-duplex dialogue training using chunked, interleaved data. Empirical evaluations on ASR/TTS quality, full-duplex capability, and turn-taking efficiency show competitive performance and superior latency relative to baselines, validating the practicality of end-to-end full-duplex dialogue without backbone modification. The work offers a scalable, data-driven pathway toward natural, real-time voice conversations and points to future enhancements in data realism, backchannels, and multi-modal extension such as vision.

Abstract

Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

TL;DR

OmniFlatten presents an end-to-end GPT-based approach for seamless full-duplex voice conversation by transforming a text LLM into a speech-text multimodal model through a multi-stage post-training pipeline and a unified flattening representation. The method blends modality alignment with ASR/TTS supervision, followed by progressively streaming-oriented half- and full-duplex dialogue training using chunked, interleaved data. Empirical evaluations on ASR/TTS quality, full-duplex capability, and turn-taking efficiency show competitive performance and superior latency relative to baselines, validating the practicality of end-to-end full-duplex dialogue without backbone modification. The work offers a scalable, data-driven pathway toward natural, real-time voice conversations and points to future enhancements in data realism, backchannels, and multi-modal extension such as vision.

Abstract

Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).

Paper Structure

This paper contains 20 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The overall architecture of our E2E full-duplex spoken dialogue model OmniFlatten.
  • Figure 2: Half-duplex Dialogue Training based on all four streams of speech and text tokens of User and Assistant, organized according to the actual speaker turns. We flatten the speech and text tokens into a single sequence, as follows: User Speech Tokens (red squares) and User Text Tokens (red circles) in Turn N-1, Assistant Text Tokens (blue circles) and Assistant Speech Tokens (blue squares) in Turn N.
  • Figure 3: Full-duplex Dialogue Training based on three streams of full-duplex dialogue data. User input and Assistant output speech and text token sequences are segmented into short chunks and flattened. At Chunk N-1, five user speech tokens (red squares) are input, and the model outputs two assistant text (blue circles) and five assistant speech tokens (blue squares). The dashed arrows denote that within a chunk, the model appends the predicted Assistant text and speech tokens into input to complete autoregressive decoding.
  • Figure 4: Full-duplex Dialogue Training based on two streams of full-duplex dialogue data (further removing the Assistant text stream). In Chunk N-1, five User speech tokens are input, and the model outputs five Assistant speech tokens in Chunk N-1.
  • Figure 6: The simulation process of dialogue learning data.