End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering
Jiliang Hu, Zuchao Li, Baoyuan Qi, Liu Guoming, Ping Wang
TL;DR
The paper tackles long-form spoken question answering (SQA) by introducing CLSR, an end-to-end contrastive language-speech retriever that distills lengthy audio into a few question-relevant clips. It achieves this by converting acoustic features to text-like representations through Continuous Integrate-and-Fire (CIF) and a vector-quantized (VQ) adaptor, then aligning these representations with text-question embeddings using a text encoder. CLSR outperforms conventional E2E speech-text retrievers and approaches or surpasses pipeline and text-based retrievers across four datasets, while enabling efficient downstream LALM processing for long contexts. A dedicated long-form SQA evaluation demonstrates substantial speedups and accuracy gains when CLSR serves as a preprocessing step for LALMs like Qwen-Audio, highlighting practical impact for real-world, long-audio QA tasks.
Abstract
Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches combining speech recognition with text retrieval, providing a robust foundation for advancing practical long-form SQA applications.
