Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting
Youngmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho
TL;DR
The paper tackles open-vocabulary keyword spotting where keywords enroll via text, addressing cross-modal heterogeneity between audio and text representations. It introduces Adversarial Deep Metric Learning (ADML), which jointly leverages Modality Adversarial Learning (MAL) and deep metric learning to align phoneme- and utterance-level embeddings in a shared space, aided by cross-attention phoneme alignment and AdaMS-enhanced AsyP loss. A SphereFace2-based keyword classification loss further enhances intra-modal discrimination, and a gradient reversal layer enables adversarial reduction of modality gaps. Experimental results on WSJ and LibriPhrase show consistent gains over baselines, including significant improvements when MAL is used across both levels and when SF2 is employed for keyword discrimination, demonstrating practical benefits for cross-modal open-vocabulary KWS and robust generalization.
Abstract
For text enrollment-based open-vocabulary keyword spotting (KWS), acoustic and text embeddings are typically compared at either the phoneme or utterance level. To facilitate this, we optimize acoustic and text encoders using deep metric learning (DML), enabling direct comparison of multi-modal embeddings in a shared embedding space. However, the inherent heterogeneity between audio and text modalities presents a significant challenge. To address this, we propose Modality Adversarial Learning (MAL), which reduces the domain gap in heterogeneous modality representations. Specifically, we train a modality classifier adversarially to encourage both encoders to generate modality-invariant embeddings. Additionally, we apply DML to achieve phoneme-level alignment between audio and text, and conduct extensive comparisons across various DML objectives. Experiments on the Wall Street Journal (WSJ) and LibriPhrase datasets demonstrate the effectiveness of the proposed approach.
