Table of Contents
Fetching ...

Anchor-aware Deep Metric Learning for Audio-visual Retrieval

Donghuo Zeng, Yanan Wang, Kazushi Ikeda, Yi Yu

TL;DR

This work tackles audio-visual cross-modal retrieval under limited training data by introducing Anchor-Aware Deep Metric Learning (AADML). It builds a correlation-graph-based manifold for each modality and derives anchor-aware proxies through attention mechanisms, capturing intra- and inter-modal dependencies. These AA proxies are integrated into standard metric-learning losses (e.g., triplet and contrastive losses) to produce richer, more discriminative embeddings. Experiments on VEGAS and AVE show state-of-the-art MAP performance and demonstrate that AA proxies enhance a range of metric-learning losses, indicating robustness and potential for broader cross-modal retrieval tasks.

Abstract

Metric learning minimizes the gap between similar (positive) pairs of data points and increases the separation of dissimilar (negative) pairs, aiming at capturing the underlying data structure and enhancing the performance of tasks like audio-visual cross-modal retrieval (AV-CMR). Recent works employ sampling methods to select impactful data points from the embedding space during training. However, the model training fails to fully explore the space due to the scarcity of training data points, resulting in an incomplete representation of the overall positive and negative distributions. In this paper, we propose an innovative Anchor-aware Deep Metric Learning (AADML) method to address this challenge by uncovering the underlying correlations among existing data points, which enhances the quality of the shared embedding space. Specifically, our method establishes a correlation graph-based manifold structure by considering the dependencies between each sample as the anchor and its semantically similar samples. Through dynamic weighting of the correlations within this underlying manifold structure using an attention-driven mechanism, Anchor Awareness (AA) scores are obtained for each anchor. These AA scores serve as data proxies to compute relative distances in metric learning approaches. Extensive experiments conducted on two audio-visual benchmark datasets demonstrate the effectiveness of our proposed AADML method, significantly surpassing state-of-the-art models. Furthermore, we investigate the integration of AA proxies with various metric learning methods, further highlighting the efficacy of our approach.

Anchor-aware Deep Metric Learning for Audio-visual Retrieval

TL;DR

This work tackles audio-visual cross-modal retrieval under limited training data by introducing Anchor-Aware Deep Metric Learning (AADML). It builds a correlation-graph-based manifold for each modality and derives anchor-aware proxies through attention mechanisms, capturing intra- and inter-modal dependencies. These AA proxies are integrated into standard metric-learning losses (e.g., triplet and contrastive losses) to produce richer, more discriminative embeddings. Experiments on VEGAS and AVE show state-of-the-art MAP performance and demonstrate that AA proxies enhance a range of metric-learning losses, indicating robustness and potential for broader cross-modal retrieval tasks.

Abstract

Metric learning minimizes the gap between similar (positive) pairs of data points and increases the separation of dissimilar (negative) pairs, aiming at capturing the underlying data structure and enhancing the performance of tasks like audio-visual cross-modal retrieval (AV-CMR). Recent works employ sampling methods to select impactful data points from the embedding space during training. However, the model training fails to fully explore the space due to the scarcity of training data points, resulting in an incomplete representation of the overall positive and negative distributions. In this paper, we propose an innovative Anchor-aware Deep Metric Learning (AADML) method to address this challenge by uncovering the underlying correlations among existing data points, which enhances the quality of the shared embedding space. Specifically, our method establishes a correlation graph-based manifold structure by considering the dependencies between each sample as the anchor and its semantically similar samples. Through dynamic weighting of the correlations within this underlying manifold structure using an attention-driven mechanism, Anchor Awareness (AA) scores are obtained for each anchor. These AA scores serve as data proxies to compute relative distances in metric learning approaches. Extensive experiments conducted on two audio-visual benchmark datasets demonstrate the effectiveness of our proposed AADML method, significantly surpassing state-of-the-art models. Furthermore, we investigate the integration of AA proxies with various metric learning methods, further highlighting the efficacy of our approach.
Paper Structure (20 sections, 10 equations, 5 figures, 3 tables)

This paper contains 20 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: This diagram illustrates the role of the anchor-aware (AA) proxy in deep metric learning. Missing embeddings due to a lack of training data points leads to suboptimal learning of the embedding space. We introduce an AA proxy derived from the correlation graph for each embedding, facilitating the migration toward optimal embedding space learning.
  • Figure 2: The framework of our proposed model. The audio and visual features extracted by pre-trained models VGGish and Inception V3, respectively, are projected into label space as predicted label embeddings. AADML approach operates within the label space and comprises three distinct components: (I) Choosing an audio sample $P_{a}^{i}$ as the anchor, we traverse the correlation graph to discern the $k$ (k=3) nearest audio samples ($P_{a}^{k}$ vs. $P_{a}^{l}$) relative to the anchor, thus forming three manifold pairs as key-value pairs for (II), to compute the attention score $A(\cdot)$ while the anchor as query ($Q_{i}$) with each pair: $P_{a}^{\in \{i, j, k\}}$ as key ($K_{\in \{i, j, k\}}$), $P_{a}^{i}$ as value ($V_{i}$). The anchor-aware $AA(\cdot)$ score (pink box) is then obtained as the average of this $A (\cdot)$ across the three key-value pairs. (III) This score is subsequently utilized as an anchor proxy for foundational metric learning methods like contrastive and triplet loss.
  • Figure 3: Precision-scope@K curves on the VEGAS dataset for $audio\! \rightarrow \! visual$ and $visual\! \rightarrow \! audio$ retrieval experiments, spanning different values of $K$ from 10 to 1000.
  • Figure 4: MAP trends on AVE dataset: AA proxy combined with three distinct triplet methods, varying with sample selections (1 to 7) in AA.
  • Figure 5: Loss value and MAP performance of training and test set from AVE dataset. Comparative analysis of triplet loss variants: exploring Triplet†, Triplet, and Hard Triplet losses with and without AA.