Table of Contents
Fetching ...

Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen

TL;DR

Baichuan-Audio presents an end-to-end audio LLM that unifies speech understanding and generation through a 12.5 Hz multi-codebook audio tokenizer and a flow-matching decoder, coupled with an independent audio head. A two-stage pre-training strategy preserves language understanding while enhancing audio modeling, enabling real-time speech interaction and robust QA. The model demonstrates strong performance across ASR, TTS, and audio understanding benchmarks, surpassing several open-source baselines and approaching or exceeding text-only upper bounds in some settings. By open-sourcing data, model, and training pipelines, it advances practical real-time voice interaction research and applications.

Abstract

We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio

Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

TL;DR

Baichuan-Audio presents an end-to-end audio LLM that unifies speech understanding and generation through a 12.5 Hz multi-codebook audio tokenizer and a flow-matching decoder, coupled with an independent audio head. A two-stage pre-training strategy preserves language understanding while enhancing audio modeling, enabling real-time speech interaction and robust QA. The model demonstrates strong performance across ASR, TTS, and audio understanding benchmarks, surpassing several open-source baselines and approaching or exceeding text-only upper bounds in some settings. By open-sourcing data, model, and training pipelines, it advances practical real-time voice interaction research and applications.

Abstract

We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio

Paper Structure

This paper contains 17 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The overview of Baichuan-Audio. Our model is an end-to-end large audio language model. When generating audio, the audio LLM alternately predicts text tokens and audio tokens. The audio tokens are then decoded by the flow-matching based audio decoder to produce the final audio.
  • Figure 2: Baichuan-Audio-Tokenizer.
  • Figure 3: Flow-matching based audio decoder.
  • Figure 4: Pipeline of interleaved data collection.