Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding
Mingyue Huo, Wei-Cheng Tseng, Yiwen Shao, Hao Zhang, Dong Yu
TL;DR
The paper addresses the bottleneck in large audio–language models where acoustic encoders struggle to balance speaker identity with paralinguistic cues. It systematically evaluates initialization strategies and training regimes, comparing multi-task learning to CLAP-style speech–text alignment on a Zipformer-based encoder, and assesses integration with LLMs via LLM-QA. The findings show that ASR-based initialization combined with multi-task supervision yields the most balanced representations across identity and paralinguistic tasks, while CLAP improves retrieval but can degrade other capabilities; LLM-QA grounding correlates with linear probing, with the multi-task encoder delivering strongest performance. The result is Auden-voice, a general-purpose, LLM-friendly voice encoder that balances cues and can be integrated into the Auden audio understanding toolkit for practical speech understanding and reasoning tasks.
Abstract
Human voice encodes both identity and paralinguistic cues, yet encoders in large audio-language models (LALMs) rarely balance both aspects. In this work, we present a study toward building a general-purpose voice encoder that captures nuanced voice cues. Through a comprehensive evaluation, we find that multi-task training yields the most balanced representations, whereas contrastive language-audio pretraining (CLAP) primarily improves retrieval without enhancing paralinguistic understanding. Our final encoder, Auden-Voice, also demonstrates strong performance when integrated with LLMs. The code and training recipes will be released with the audio understanding toolkit Auden.
