Table of Contents
Fetching ...

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Wenhao Guan, Zhikang Niu, Ziyue Jiang, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li, Xie Chen

TL;DR

UniVoice tackles unifying automatic speech recognition and text-to-speech in a single model by operating in continuous speech representations. It fuses autoregressive ASR with flow-matching–based TTS inside a dual-branch transformer, using a dual attention mask to reconcile causal and non-autoregressive requirements and a text-prefix infilling strategy for high-fidelity zero-shot voice cloning. The training objective combines $L_{LM}$ for AR ASR and $L_{audio}^{cfm}$ for FM-based TTS, weighted by $\lambda$, with classifier-free guidance to improve robustness. On LibriHeavy, UniVoice achieves competitive ASR performance and state-of-the-art-like zero-shot TTS quality among unified models, while remaining parameter-efficient, indicating strong potential for end-to-end unified speech understanding and generation.

Abstract

Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation. Code is available at https://github.com/gwh22/UniVoice.

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

TL;DR

UniVoice tackles unifying automatic speech recognition and text-to-speech in a single model by operating in continuous speech representations. It fuses autoregressive ASR with flow-matching–based TTS inside a dual-branch transformer, using a dual attention mask to reconcile causal and non-autoregressive requirements and a text-prefix infilling strategy for high-fidelity zero-shot voice cloning. The training objective combines for AR ASR and for FM-based TTS, weighted by , with classifier-free guidance to improve robustness. On LibriHeavy, UniVoice achieves competitive ASR performance and state-of-the-art-like zero-shot TTS quality among unified models, while remaining parameter-efficient, indicating strong potential for end-to-end unified speech understanding and generation.

Abstract

Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation. Code is available at https://github.com/gwh22/UniVoice.

Paper Structure

This paper contains 39 sections, 8 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: An overview of UniVoice model. Blue elements (blocks and lines) denote ASR components, while green elements represent TTS components. The gradient-colored modules (blue-to-green or green-to-blue) indicate shared components between both ASR and TTS systems.
  • Figure 2: Two variants of TTS model designs in UniVoice. (a) uniVoice-TTS-speaker. (b) UniVoice-TTS-infilling.