Exploring Robust Face-Voice Matching in Multilingual Environments
Jiehui Tang, Xiaofei Wang, Zhen Xiao, Jiayi Liu, Xueliang Liu, Richang Hong
TL;DR
This work tackles cross-modal face-voice verification in multilingual settings by extending Fusion and Orthogonal Projection (FOP) with a dual-branch architecture, dynamic sample-pair weighting, targeted data augmentation, and a score polarization strategy based on age and gender cues. The dual-branch design separates a fixed, frozen FOP feature extractor from a trainable updater and fuses outputs via an attention-guided ConvLayer, improving robustness across languages. Dynamic weighting emphasizes hard positive and negative pairs using a similarity-aware loss, while augmentation disrupts original pairings to broaden training scenarios. Age- and gender-informed score adjustments further refine final decisions, collectively yielding state-of-the-art or near-top performance on the MAV-Celeb FAME2024 benchmark, with EERs around $20.07$–$21.76$ and top-3 placement. The approach demonstrates practical impact for multilingual biometric verification and offers a framework for integrating multimodal signals under diverse linguistic conditions.
Abstract
This paper presents Team Xaiofei's innovative approach to exploring Face-Voice Association in Multilingual Environments (FAME) at ACM Multimedia 2024. We focus on the impact of different languages in face-voice matching by building upon Fusion and Orthogonal Projection (FOP), introducing four key components: a dual-branch structure, dynamic sample pair weighting, robust data augmentation, and score polarization strategy. Our dual-branch structure serves as an auxiliary mechanism to better integrate and provide more comprehensive information. We also introduce a dynamic weighting mechanism for various sample pairs to optimize learning. Data augmentation techniques are employed to enhance the model's generalization across diverse conditions. Additionally, score polarization strategy based on age and gender matching confidence clarifies and accentuates the final results. Our methods demonstrate significant effectiveness, achieving an equal error rate (EER) of 20.07 on the V2-EH dataset and 21.76 on the V1-EU dataset.
