Table of Contents
Fetching ...

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong Chen, Jun Zhou, Kaimeng Ren, Ming Yang, Mingxue Yang, Qiang Xu, Qin Zhao, Ruijie Xiong, Shaoxiong Lin, Xuezhi Wang, Yi Yuan, Yifei Wu, Yongjie Lyu, Zhengyu He, Zhihao Qiu, Zhiqiang Fang, Ziyuan Huang

TL;DR

Ming-UniAudio introduces a unified continuous speech tokenizer, MingTok-Audio, and a decoder-based speech LLM backbone to jointly enable understanding, generation, and free-form editing from a single representation. The approach relies on three-stage training (acoustic reconstruction, semantic distillation, and unified tokenizer training with an LLM), a diffusion-based per-token generation head, and a dedicated editing model (Ming-UniAudio-Edit) built atop the unified tokenizer. Empirical results show state-of-the-art performance on ContextASR across 8 of 12 subtasks, competitive Seed-TTS-WER for Chinese, and robust free-form speech editing evaluated with Ming-Freeform-Audio-Edit across semantic and acoustic tasks. The work is open-sourced, delivering the tokenizer, the unified model, and the free-form editing benchmark to foster unified audio understanding, generation, and manipulation research.

Abstract

Existing speech models suffer from competing requirements on token representations by understanding and generation tasks. This discrepancy in representation prevents speech language models from performing instruction-based free-form editing. To solve this challenge, we introduce a novel framework that unifies speech understanding, generation, and editing. The core of our unified model is a unified continuous speech tokenizer MingTok-Audio, the first continuous tokenizer to effectively integrate semantic and acoustic features, which makes it suitable for both understanding and generation tasks. Based on this unified continuous audio tokenizer, we developed the speech language model Ming-UniAudio, which achieved a balance between generation and understanding capabilities. Ming-UniAudio sets new state-of-the-art (SOTA) records on 8 out of 12 metrics on the ContextASR benchmark. Notably, for Chinese voice cloning, it achieves a highly competitive Seed-TTS-WER of 0.95. Leveraging this foundational model, we further trained a dedicated speech editing model Ming-UniAudio-Edit, the first speech language model that enables universal, free-form speech editing guided solely by natural language instructions, handling both semantic and acoustic modifications without timestamp condition. To rigorously assess the editing capability and establish a foundation for future research, we introduce Ming-Freeform-Audio-Edit, the first comprehensive benchmark tailored for instruction-based free-form speech editing, featuring diverse scenarios and evaluation dimensions spanning semantic correctness, acoustic quality, and instruction alignment. We open-sourced the continuous audio tokenizer, the unified foundational model, and the free-form instruction-based editing model to facilitate the development of unified audio understanding, generation, and manipulation.

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

TL;DR

Ming-UniAudio introduces a unified continuous speech tokenizer, MingTok-Audio, and a decoder-based speech LLM backbone to jointly enable understanding, generation, and free-form editing from a single representation. The approach relies on three-stage training (acoustic reconstruction, semantic distillation, and unified tokenizer training with an LLM), a diffusion-based per-token generation head, and a dedicated editing model (Ming-UniAudio-Edit) built atop the unified tokenizer. Empirical results show state-of-the-art performance on ContextASR across 8 of 12 subtasks, competitive Seed-TTS-WER for Chinese, and robust free-form speech editing evaluated with Ming-Freeform-Audio-Edit across semantic and acoustic tasks. The work is open-sourced, delivering the tokenizer, the unified model, and the free-form editing benchmark to foster unified audio understanding, generation, and manipulation research.

Abstract

Existing speech models suffer from competing requirements on token representations by understanding and generation tasks. This discrepancy in representation prevents speech language models from performing instruction-based free-form editing. To solve this challenge, we introduce a novel framework that unifies speech understanding, generation, and editing. The core of our unified model is a unified continuous speech tokenizer MingTok-Audio, the first continuous tokenizer to effectively integrate semantic and acoustic features, which makes it suitable for both understanding and generation tasks. Based on this unified continuous audio tokenizer, we developed the speech language model Ming-UniAudio, which achieved a balance between generation and understanding capabilities. Ming-UniAudio sets new state-of-the-art (SOTA) records on 8 out of 12 metrics on the ContextASR benchmark. Notably, for Chinese voice cloning, it achieves a highly competitive Seed-TTS-WER of 0.95. Leveraging this foundational model, we further trained a dedicated speech editing model Ming-UniAudio-Edit, the first speech language model that enables universal, free-form speech editing guided solely by natural language instructions, handling both semantic and acoustic modifications without timestamp condition. To rigorously assess the editing capability and establish a foundation for future research, we introduce Ming-Freeform-Audio-Edit, the first comprehensive benchmark tailored for instruction-based free-form speech editing, featuring diverse scenarios and evaluation dimensions spanning semantic correctness, acoustic quality, and instruction alignment. We open-sourced the continuous audio tokenizer, the unified foundational model, and the free-form instruction-based editing model to facilitate the development of unified audio understanding, generation, and manipulation.

Paper Structure

This paper contains 64 sections, 6 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: Comparison of Ming-UniAudio with Open-Source Speech LLMs across Understanding, Generation, and Editing Tasks.
  • Figure 2: The overall framework of MingTok-Audio
  • Figure 3: Model Archieture of Ming-UniAudio
  • Figure 4: Overview of Audio Data Processing Pipeline.
  • Figure 5: The loss curve of the ablation study for semantic module freezing and diffusion head initialization.
  • ...and 3 more figures