Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
Wei-Cheng Tseng, Xuanru Zhou, Mingyue Huo, Yiwen Shao, Hao Zhang, Dong Yu
TL;DR
The paper addresses the challenge of learning general-purpose audio representations through audio–language pretraining, which has lagged behind vision-language progress. It introduces CaptionStew, a large and diverse dataset of 9.3M audio samples with 10.7M captions (37,290 hours), and conducts the first thorough, cross-task evaluation comparing contrastive and captioning objectives across speech, music, and environmental sounds. Key findings reveal that contrastive learning offers data efficiency for discriminative tasks, while captioning scales better for language-involved audio understanding, with initialization benefits diminishing at scale; collectively, these results establish audio–language pretraining as a viable path toward universal audio representations. The work also provides reproducible pipelines, data preparation recipes, and pretrained models to accelerate progress in universal audio understanding. The analysis highlights the value of diverse caption sources and suggests avenues for improving caption diversity to further boost representation quality.
Abstract
Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing audio-language models primarily excel at retrieval tasks with limited adoption as general-purpose encoders. We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation. To this end, we introduce CaptionStew, a 10.7M caption dataset aggregating diverse open-source audio-text corpora across multiple domains and captioning styles. Using this resource, we conduct the first comprehensive evaluation comparing contrastive and captioning objectives for audio representation learning across speech, music, and environmental sound tasks. Our results demonstrate that audio-language pretraining yields competitive, transferable representations. Through systematic data-scaling experiments, we reveal complementary objective strengths: contrastive learning achieves superior data efficiency at smaller scales, while captioning demonstrates better scalability on language-involved audio understanding tasks. We also find that common supervised initialization practices provide diminishing returns at scale, challenging current approaches. These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations, guiding future research. To accelerate progress, we release data preparation recipes, training protocols, and pretrained models, paving the way toward universal audio understanding.
