Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Yifei Xin; Zhihong Zhu; Xuxin Cheng; Xusheng Yang; Yuexian Zou

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou

TL;DR

This work tackles audio-text retrieval by moving beyond single-level alignment to a Transformer-based hierarchical approach (THA) that leverages multi-level correspondences between audio and text. It introduces Disentangled Cross-modal Representation (DCR) to factor high-dimensional features into compact latent elements and uses a Confidence-Aware (CA) module to weight factor-level alignments, optimizing a contrastive objective that combines global and local signals: $L = L_S + \alpha L_D + \beta L_A$. The dual-stream architecture aligns audio and text through both hierarchical (THA) and disentangled (DCR) pathways, yielding consistent gains on AudioCaps and Clotho, with THA+DCR achieving the best results. These advances enable finer-grained cross-modal matching and more robust ATR performance, with practical potential for improved search and retrieval across audio and textual modalities.

Abstract

Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a novel ATR framework that leverages two-stream Transformers in conjunction with a Hierarchical Alignment (THA) module to identify multi-level correspondences of different Transformer blocks between audio and text. Moreover, current ATR methods mainly focus on learning a global-level representation, missing out on intricate details to capture audio occurrences that correspond to textual semantics. To bridge this gap, we introduce a Disentangled Cross-modal Representation (DCR) approach that disentangles high-dimensional features into compact latent factors to grasp fine-grained audio-text semantic correlations. Additionally, we develop a confidence-aware (CA) module to estimate the confidence of each latent factor pair and adaptively aggregate cross-modal latent factors to achieve local semantic alignment. Experiments show that our THA effectively boosts ATR performance, with the DCR approach further contributing to consistent performance gains.

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

TL;DR

. The dual-stream architecture aligns audio and text through both hierarchical (THA) and disentangled (DCR) pathways, yielding consistent gains on AudioCaps and Clotho, with THA+DCR achieving the best results. These advances enable finer-grained cross-modal matching and more robust ATR performance, with practical potential for improved search and retrieval across audio and textual modalities.

Abstract

Paper Structure (11 sections, 5 equations, 2 figures, 5 tables)

This paper contains 11 sections, 5 equations, 2 figures, 5 tables.

Introduction
Proposed Methods
Dual-stream Transofrmers
Hierarchical Alignment Module
Disentangled Cross-modal Representation from both Inter-factor and Intra-factor Perspectives
Adaptive Aggregation for Similarity Calculation
Experiments
Dataset and Experiment Details
Experimental Results
Ablation Study
Conclusions

Figures (2)

Figure 1: Illustration of our Transformer-based hierarchical alignment module. We take one-stage interaction as an example.
Figure 2: Overview of the disentangled cross-modal representation (DCR) framework.

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

TL;DR

Abstract

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)