Table of Contents
Fetching ...

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou

TL;DR

This work tackles audio-text retrieval by moving beyond single-level alignment to a Transformer-based hierarchical approach (THA) that leverages multi-level correspondences between audio and text. It introduces Disentangled Cross-modal Representation (DCR) to factor high-dimensional features into compact latent elements and uses a Confidence-Aware (CA) module to weight factor-level alignments, optimizing a contrastive objective that combines global and local signals: $L = L_S + \alpha L_D + \beta L_A$. The dual-stream architecture aligns audio and text through both hierarchical (THA) and disentangled (DCR) pathways, yielding consistent gains on AudioCaps and Clotho, with THA+DCR achieving the best results. These advances enable finer-grained cross-modal matching and more robust ATR performance, with practical potential for improved search and retrieval across audio and textual modalities.

Abstract

Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a novel ATR framework that leverages two-stream Transformers in conjunction with a Hierarchical Alignment (THA) module to identify multi-level correspondences of different Transformer blocks between audio and text. Moreover, current ATR methods mainly focus on learning a global-level representation, missing out on intricate details to capture audio occurrences that correspond to textual semantics. To bridge this gap, we introduce a Disentangled Cross-modal Representation (DCR) approach that disentangles high-dimensional features into compact latent factors to grasp fine-grained audio-text semantic correlations. Additionally, we develop a confidence-aware (CA) module to estimate the confidence of each latent factor pair and adaptively aggregate cross-modal latent factors to achieve local semantic alignment. Experiments show that our THA effectively boosts ATR performance, with the DCR approach further contributing to consistent performance gains.

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

TL;DR

This work tackles audio-text retrieval by moving beyond single-level alignment to a Transformer-based hierarchical approach (THA) that leverages multi-level correspondences between audio and text. It introduces Disentangled Cross-modal Representation (DCR) to factor high-dimensional features into compact latent elements and uses a Confidence-Aware (CA) module to weight factor-level alignments, optimizing a contrastive objective that combines global and local signals: . The dual-stream architecture aligns audio and text through both hierarchical (THA) and disentangled (DCR) pathways, yielding consistent gains on AudioCaps and Clotho, with THA+DCR achieving the best results. These advances enable finer-grained cross-modal matching and more robust ATR performance, with practical potential for improved search and retrieval across audio and textual modalities.

Abstract

Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a novel ATR framework that leverages two-stream Transformers in conjunction with a Hierarchical Alignment (THA) module to identify multi-level correspondences of different Transformer blocks between audio and text. Moreover, current ATR methods mainly focus on learning a global-level representation, missing out on intricate details to capture audio occurrences that correspond to textual semantics. To bridge this gap, we introduce a Disentangled Cross-modal Representation (DCR) approach that disentangles high-dimensional features into compact latent factors to grasp fine-grained audio-text semantic correlations. Additionally, we develop a confidence-aware (CA) module to estimate the confidence of each latent factor pair and adaptively aggregate cross-modal latent factors to achieve local semantic alignment. Experiments show that our THA effectively boosts ATR performance, with the DCR approach further contributing to consistent performance gains.
Paper Structure (11 sections, 5 equations, 2 figures, 5 tables)

This paper contains 11 sections, 5 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Illustration of our Transformer-based hierarchical alignment module. We take one-stage interaction as an example.
  • Figure 2: Overview of the disentangled cross-modal representation (DCR) framework.