Table of Contents
Fetching ...

DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Xiangang Li, Jieping Ye

TL;DR

DrVoice tackles efficient end-to-end joint speech-text generation by introducing Dual-Resolution Speech Representations that downsample speech representations to a 5Hz LLM input rate via a grouping factor $k=5$. The model employs a Shared LLM Layer with a Text Head and a Speech Refined Head (SRH) to generate text and ungrouped speech tokens in parallel, guided by CoM reasoning and a two-stage Core-Cocktail training regime that preserves base LLM knowledge. The approach achieves state-of-the-art results on OpenAudioBench, VoiceBench, UltraEval-Audio, and Big Bench Audio in roughly 7B-parameter regimes, while reducing GPU hours by about 50% during training. Key findings show that CSE, SRH, and data curriculum (CoM-Mixing) are essential for balancing speech understanding and generation with text capabilities, and that data quality significantly influences real-world performance. Overall, DrVoice advances open-source speech foundation modeling by delivering strong cross-modal coherence, efficiency, and scalability for practical voice conversation systems.

Abstract

Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs' capabilities. Experimental results demonstrate that DrVoice-7B establishes new state-of-the-art (SOTA) on prominent speech benchmarks including OpenAudioBench, VoiceBench, UltraEval-Audio and Big Bench Audio, making it a leading open-source speech foundation model in ~7B models.

DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

TL;DR

DrVoice tackles efficient end-to-end joint speech-text generation by introducing Dual-Resolution Speech Representations that downsample speech representations to a 5Hz LLM input rate via a grouping factor . The model employs a Shared LLM Layer with a Text Head and a Speech Refined Head (SRH) to generate text and ungrouped speech tokens in parallel, guided by CoM reasoning and a two-stage Core-Cocktail training regime that preserves base LLM knowledge. The approach achieves state-of-the-art results on OpenAudioBench, VoiceBench, UltraEval-Audio, and Big Bench Audio in roughly 7B-parameter regimes, while reducing GPU hours by about 50% during training. Key findings show that CSE, SRH, and data curriculum (CoM-Mixing) are essential for balancing speech understanding and generation with text capabilities, and that data quality significantly influences real-world performance. Overall, DrVoice advances open-source speech foundation modeling by delivering strong cross-modal coherence, efficiency, and scalability for practical voice conversation systems.

Abstract

Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs' capabilities. Experimental results demonstrate that DrVoice-7B establishes new state-of-the-art (SOTA) on prominent speech benchmarks including OpenAudioBench, VoiceBench, UltraEval-Audio and Big Bench Audio, making it a leading open-source speech foundation model in ~7B models.

Paper Structure

This paper contains 16 sections, 9 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of DrVoice. User speech inputs are tokenized, grouped, and encoded by the MLLM for autoregressive text and speech token prediction. The MLLM consists of Shared LLM Layer, a Text Head, and a Speech Refined Head (SRH) for token generation. The generated speech tokens are then converted to speech waveform by the speech detokenizer. Note that SRH generates $k$ speech tokens through $k$ autoregressive forward passes, where $k$ is the grouping factor.
  • Figure 2: Computational Resources under 17K hours training data across different Grouping Factor.
  • Figure 3: Performance Scaling of DrVoice-Small (w/o. Continuous Speech Encoder) on LLaMA Question Benchmark.