Table of Contents
Fetching ...

Exploring Robust Face-Voice Matching in Multilingual Environments

Jiehui Tang, Xiaofei Wang, Zhen Xiao, Jiayi Liu, Xueliang Liu, Richang Hong

TL;DR

This work tackles cross-modal face-voice verification in multilingual settings by extending Fusion and Orthogonal Projection (FOP) with a dual-branch architecture, dynamic sample-pair weighting, targeted data augmentation, and a score polarization strategy based on age and gender cues. The dual-branch design separates a fixed, frozen FOP feature extractor from a trainable updater and fuses outputs via an attention-guided ConvLayer, improving robustness across languages. Dynamic weighting emphasizes hard positive and negative pairs using a similarity-aware loss, while augmentation disrupts original pairings to broaden training scenarios. Age- and gender-informed score adjustments further refine final decisions, collectively yielding state-of-the-art or near-top performance on the MAV-Celeb FAME2024 benchmark, with EERs around $20.07$–$21.76$ and top-3 placement. The approach demonstrates practical impact for multilingual biometric verification and offers a framework for integrating multimodal signals under diverse linguistic conditions.

Abstract

This paper presents Team Xaiofei's innovative approach to exploring Face-Voice Association in Multilingual Environments (FAME) at ACM Multimedia 2024. We focus on the impact of different languages in face-voice matching by building upon Fusion and Orthogonal Projection (FOP), introducing four key components: a dual-branch structure, dynamic sample pair weighting, robust data augmentation, and score polarization strategy. Our dual-branch structure serves as an auxiliary mechanism to better integrate and provide more comprehensive information. We also introduce a dynamic weighting mechanism for various sample pairs to optimize learning. Data augmentation techniques are employed to enhance the model's generalization across diverse conditions. Additionally, score polarization strategy based on age and gender matching confidence clarifies and accentuates the final results. Our methods demonstrate significant effectiveness, achieving an equal error rate (EER) of 20.07 on the V2-EH dataset and 21.76 on the V1-EU dataset.

Exploring Robust Face-Voice Matching in Multilingual Environments

TL;DR

This work tackles cross-modal face-voice verification in multilingual settings by extending Fusion and Orthogonal Projection (FOP) with a dual-branch architecture, dynamic sample-pair weighting, targeted data augmentation, and a score polarization strategy based on age and gender cues. The dual-branch design separates a fixed, frozen FOP feature extractor from a trainable updater and fuses outputs via an attention-guided ConvLayer, improving robustness across languages. Dynamic weighting emphasizes hard positive and negative pairs using a similarity-aware loss, while augmentation disrupts original pairings to broaden training scenarios. Age- and gender-informed score adjustments further refine final decisions, collectively yielding state-of-the-art or near-top performance on the MAV-Celeb FAME2024 benchmark, with EERs around and top-3 placement. The approach demonstrates practical impact for multilingual biometric verification and offers a framework for integrating multimodal signals under diverse linguistic conditions.

Abstract

This paper presents Team Xaiofei's innovative approach to exploring Face-Voice Association in Multilingual Environments (FAME) at ACM Multimedia 2024. We focus on the impact of different languages in face-voice matching by building upon Fusion and Orthogonal Projection (FOP), introducing four key components: a dual-branch structure, dynamic sample pair weighting, robust data augmentation, and score polarization strategy. Our dual-branch structure serves as an auxiliary mechanism to better integrate and provide more comprehensive information. We also introduce a dynamic weighting mechanism for various sample pairs to optimize learning. Data augmentation techniques are employed to enhance the model's generalization across diverse conditions. Additionally, score polarization strategy based on age and gender matching confidence clarifies and accentuates the final results. Our methods demonstrate significant effectiveness, achieving an equal error rate (EER) of 20.07 on the V2-EH dataset and 21.76 on the V1-EU dataset.
Paper Structure (16 sections, 13 equations, 2 figures, 5 tables)

This paper contains 16 sections, 13 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Illustration of the dataset and cross-modal verification process. The training set contains audio and facial features in English (Urdu). The testing set includes samples in both English and Urdu. Cross-modal verification is performed using multiple languages, comparing facial features with audio features from different languages.
  • Figure 2: The primary methodology employed by our team in the challenge: (a) the overarching architecture of the dual-branch model, (b) our convLayer for fusion (c) the dynamic weight configuration.