Table of Contents
Fetching ...

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Yifei Xin, Yuexian Zou

TL;DR

This work tackles audio-text retrieval by introducing hierarchical cross-modal interaction (HCI) to jointly model clip-sentence, segment-phrase, and frame-word relations, enabling fine-grained cross-modal alignment. It further proposes an auxiliary captions (AC) framework that leverages captions generated by a pretrained captioner for data augmentation, audio-caption feature interaction, and complementary text-caption matching, providing robust improvements beyond standard audio-text matching. Empirical results on AudioCaps and Clotho demonstrate consistent gains from HCI across encoders, while AC yields large gains that are further amplified when combined with HCI. The approach delivers a scalable, multi-granularity cross-modal framework with practical impact for retrieval tasks where fine-grained alignment and additional textual cues are available.

Abstract

Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phrase, and frame-word relationships, achieving a comprehensive multi-modal semantic comparison. Besides, we also present a novel ATR framework that leverages auxiliary captions (AC) generated by a pretrained captioner to perform feature interaction between audio and generated captions, which yields enhanced audio representations and is complementary to the original ATR matching branch. The audio and generated captions can also form new audio-text pairs as data augmentation for training. Experiments show that our HCI significantly improves the ATR performance. Moreover, our AC framework also shows stable performance gains on multiple datasets.

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

TL;DR

This work tackles audio-text retrieval by introducing hierarchical cross-modal interaction (HCI) to jointly model clip-sentence, segment-phrase, and frame-word relations, enabling fine-grained cross-modal alignment. It further proposes an auxiliary captions (AC) framework that leverages captions generated by a pretrained captioner for data augmentation, audio-caption feature interaction, and complementary text-caption matching, providing robust improvements beyond standard audio-text matching. Empirical results on AudioCaps and Clotho demonstrate consistent gains from HCI across encoders, while AC yields large gains that are further amplified when combined with HCI. The approach delivers a scalable, multi-granularity cross-modal framework with practical impact for retrieval tasks where fine-grained alignment and additional textual cues are available.

Abstract

Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phrase, and frame-word relationships, achieving a comprehensive multi-modal semantic comparison. Besides, we also present a novel ATR framework that leverages auxiliary captions (AC) generated by a pretrained captioner to perform feature interaction between audio and generated captions, which yields enhanced audio representations and is complementary to the original ATR matching branch. The audio and generated captions can also form new audio-text pairs as data augmentation for training. Experiments show that our HCI significantly improves the ATR performance. Moreover, our AC framework also shows stable performance gains on multiple datasets.
Paper Structure (15 sections, 11 equations, 2 figures, 5 tables)

This paper contains 15 sections, 11 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The overview of our hierarchical cross-modal interaction method for ATR.
  • Figure 2: The overview of our auxiliary captions (AC) framework for ATR.