Table of Contents
Fetching ...

Language-based Audio Retrieval with Co-Attention Networks

Haoran Sun, Zimu Wang, Qiuyi Chen, Jianjun Chen, Jia Wang, Haiyang Zhang

TL;DR

This work tackles language-based audio retrieval, where natural language queries are used to find relevant audio clips across heterogeneous text and audio modalities. It introduces a cascaded co-attention framework that jointly attends to text words and audio segments, comprising a single co-attention module and two deep designs—stacking and iterating—to progressively refine cross-modal alignment. Audio and text are encoded with CLAP and RoBERTa, respectively, with contrastive learning (NT-Xent) guiding the retrieval objective; the approach is further enhanced by ChatGPT-based caption augmentation. Empirical results on Clotho and AudioCaps show substantial gains over baselines and state-of-the-art, with the iterating co-attention achieving up to a 16.6% (Clotho) and 15.1% (AudioCaps) improvement in mean Average Precision, underscoring the value of deep cross-modal fusion for practical audio search applications.

Abstract

In recent years, user-generated audio content has proliferated across various media platforms, creating a growing need for efficient retrieval methods that allow users to search for audio clips using natural language queries. This task, known as language-based audio retrieval, presents significant challenges due to the complexity of learning semantic representations from heterogeneous data across both text and audio modalities. In this work, we introduce a novel framework for the language-based audio retrieval task that leverages co-attention mechanismto jointly learn meaningful representations from both modalities. To enhance the model's ability to capture fine-grained cross-modal interactions, we propose a cascaded co-attention architecture, where co-attention modules are stacked or iterated to progressively refine the semantic alignment between text and audio. Experiments conducted on two public datasets show that the proposed method can achieve better performance than the state-of-the-art method. Specifically, our best performed co-attention model achieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a 15.1% improvement on AudioCaps.

Language-based Audio Retrieval with Co-Attention Networks

TL;DR

This work tackles language-based audio retrieval, where natural language queries are used to find relevant audio clips across heterogeneous text and audio modalities. It introduces a cascaded co-attention framework that jointly attends to text words and audio segments, comprising a single co-attention module and two deep designs—stacking and iterating—to progressively refine cross-modal alignment. Audio and text are encoded with CLAP and RoBERTa, respectively, with contrastive learning (NT-Xent) guiding the retrieval objective; the approach is further enhanced by ChatGPT-based caption augmentation. Empirical results on Clotho and AudioCaps show substantial gains over baselines and state-of-the-art, with the iterating co-attention achieving up to a 16.6% (Clotho) and 15.1% (AudioCaps) improvement in mean Average Precision, underscoring the value of deep cross-modal fusion for practical audio search applications.

Abstract

In recent years, user-generated audio content has proliferated across various media platforms, creating a growing need for efficient retrieval methods that allow users to search for audio clips using natural language queries. This task, known as language-based audio retrieval, presents significant challenges due to the complexity of learning semantic representations from heterogeneous data across both text and audio modalities. In this work, we introduce a novel framework for the language-based audio retrieval task that leverages co-attention mechanismto jointly learn meaningful representations from both modalities. To enhance the model's ability to capture fine-grained cross-modal interactions, we propose a cascaded co-attention architecture, where co-attention modules are stacked or iterated to progressively refine the semantic alignment between text and audio. Experiments conducted on two public datasets show that the proposed method can achieve better performance than the state-of-the-art method. Specifically, our best performed co-attention model achieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a 15.1% improvement on AudioCaps.
Paper Structure (15 sections, 13 equations, 4 figures, 3 tables)

This paper contains 15 sections, 13 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Brief illustration of our proposed model, which includes a GPT generate component for text argumentation, both audio and text self-attended components, a co-attention module and a model training and fusion component.
  • Figure 2: Brief illustration of the single co-attention module, which includes two self-attention modules and a guided-attention components.
  • Figure 3: Brief illustration of the guided-attention components for both modalities.
  • Figure 4: The structure of stacking and iterating modules. stacking module simply stacks the attention components and iterating module trains the network hierarchically by iterating the embedding only requiring self-attention first.