Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning

Donghuo Zeng; Kazushi Ikeda

Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning

Donghuo Zeng, Kazushi Ikeda

TL;DR

The paper tackles the limitation of label-guided metric learning for audio-visual embedding by exploiting latent cross-modal distributions through progressive self-distillation. It introduces a soft cross-modal triplet framework guided by soft alignment labels generated by a teacher network, with batch partitioning and a gradually shrinking annotation ratio $r$ to transition toward self-supervised supervision. The method combines a labeled loss $l_{lab}$, a cross-modal triplet loss $l_{cross}$, and a cross-modal decorrelation loss $l_{dis}$ to learn a robust shared embedding space, achieving improvements on AVE and VEGAS datasets for audio-visual cross-modal retrieval. The experimental results show consistent MAP gains over state-of-the-art baselines, validating the effectiveness of soft alignments and progressive self-distillation for multimodal representation learning, with practical implications for scalable cross-modal retrieval systems.

Abstract

Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified based on their learned representations. However, existing methods often rely on label-guided representation learning, where representations of different modalities, such as audio and visual data, are aligned based on annotated labels. This approach tends to underutilize latent complex features and potential relationships inherent in the distributions of audio and visual data that are not directly tied to the labels, resulting in suboptimal performance in audio-visual embedding learning. To address this issue, we propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation. Our method enhances representation learning by leveraging inherent distributions and dynamically refining soft audio-visual alignments -- probabilistic alignments between audio and visual data that capture the inherent relationships beyond explicit labels. Specifically, the model distills audio-visual distribution-based knowledge from annotated labels in a subset of each batch. This self-distilled knowledge is used t

Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning

TL;DR

to transition toward self-supervised supervision. The method combines a labeled loss

, a cross-modal triplet loss

, and a cross-modal decorrelation loss

to learn a robust shared embedding space, achieving improvements on AVE and VEGAS datasets for audio-visual cross-modal retrieval. The experimental results show consistent MAP gains over state-of-the-art baselines, validating the effectiveness of soft alignments and progressive self-distillation for multimodal representation learning, with practical implications for scalable cross-modal retrieval systems.

Abstract

Paper Structure (16 sections, 4 equations, 3 figures, 2 tables)

This paper contains 16 sections, 4 equations, 3 figures, 2 tables.

Introduction
Related Work
Audio-visual Embedding Learning
Self-knowledge Distillation
Approach
Preliminaries
Soft Cross-modal Triplet
Progress Self-distillation
Experiments
Datasets and Metrics
Implementation Settings
Results
Ablation Studies
Triplet Selection Strategy
Impact of Different Components
...and 1 more sections

Figures (3)

Figure 1: The standard triplet loss pulls the positives closer to the anchor ($a_1$) and pushes the negatives away but struggles with limited data. AADML zeng2024anchor enhances dependencies between similar samples but overlooks the latent information inherent in the distributions beyond labels. Our approach addresses this by generating soft audio-visual alignments, enhancing embeddings for cross-modal tasks like retrieving baby crying visuals from the audio on the VEGAS zhou2018visual and improving its average precision (AP).
Figure 2: The overview of our approach. The batch is divided into annotated instances (light blue background) and unannotated instances (light grey background) in the above three matrices by the hyper-parameter $r$. The teacher network is trained on annotated instances using the cross-modal triplet loss. The teacher estimates soft-alignment labels for the unannotated data, which are then used to supervise the student network. As training progresses and the teacher's representations become more reliable, the proportion of soft-alignment labels provided to the student increases.
Figure 3: Comparison of loss and MAP over epochs on the AVE dataset between ours and AADML in triplet and hard triplet.

Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning

TL;DR

Abstract

Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)