Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning
Donghuo Zeng, Kazushi Ikeda
TL;DR
The paper tackles the limitation of label-guided metric learning for audio-visual embedding by exploiting latent cross-modal distributions through progressive self-distillation. It introduces a soft cross-modal triplet framework guided by soft alignment labels generated by a teacher network, with batch partitioning and a gradually shrinking annotation ratio $r$ to transition toward self-supervised supervision. The method combines a labeled loss $l_{lab}$, a cross-modal triplet loss $l_{cross}$, and a cross-modal decorrelation loss $l_{dis}$ to learn a robust shared embedding space, achieving improvements on AVE and VEGAS datasets for audio-visual cross-modal retrieval. The experimental results show consistent MAP gains over state-of-the-art baselines, validating the effectiveness of soft alignments and progressive self-distillation for multimodal representation learning, with practical implications for scalable cross-modal retrieval systems.
Abstract
Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified based on their learned representations. However, existing methods often rely on label-guided representation learning, where representations of different modalities, such as audio and visual data, are aligned based on annotated labels. This approach tends to underutilize latent complex features and potential relationships inherent in the distributions of audio and visual data that are not directly tied to the labels, resulting in suboptimal performance in audio-visual embedding learning. To address this issue, we propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation. Our method enhances representation learning by leveraging inherent distributions and dynamically refining soft audio-visual alignments -- probabilistic alignments between audio and visual data that capture the inherent relationships beyond explicit labels. Specifically, the model distills audio-visual distribution-based knowledge from annotated labels in a subset of each batch. This self-distilled knowledge is used t
