DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations
Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Xiangang Li, Jieping Ye
TL;DR
DrVoice tackles efficient end-to-end joint speech-text generation by introducing Dual-Resolution Speech Representations that downsample speech representations to a 5Hz LLM input rate via a grouping factor $k=5$. The model employs a Shared LLM Layer with a Text Head and a Speech Refined Head (SRH) to generate text and ungrouped speech tokens in parallel, guided by CoM reasoning and a two-stage Core-Cocktail training regime that preserves base LLM knowledge. The approach achieves state-of-the-art results on OpenAudioBench, VoiceBench, UltraEval-Audio, and Big Bench Audio in roughly 7B-parameter regimes, while reducing GPU hours by about 50% during training. Key findings show that CSE, SRH, and data curriculum (CoM-Mixing) are essential for balancing speech understanding and generation with text capabilities, and that data quality significantly influences real-world performance. Overall, DrVoice advances open-source speech foundation modeling by delivering strong cross-modal coherence, efficiency, and scalability for practical voice conversation systems.
Abstract
Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs' capabilities. Experimental results demonstrate that DrVoice-7B establishes new state-of-the-art (SOTA) on prominent speech benchmarks including OpenAudioBench, VoiceBench, UltraEval-Audio and Big Bench Audio, making it a leading open-source speech foundation model in ~7B models.
