Table of Contents
Fetching ...

Shared Multi-modal Embedding Space for Face-Voice Association

Christopher Simic, Korbinian Riedhammer, Tobias Bocklet

TL;DR

The paper tackles face–voice association in multilingual settings and unseen languages by training modality-specific audio and image pipelines that project into a shared embedding space using Adaptive Angular Margin loss. It leverages VoxCeleb2, CommonVoice, and MavCeleb data with heard/unheard language splits and cross-validation, and compares modality-separated embeddings to a fully fused cross-attention approach. The best-performing strategy combines pretraining on large datasets with targeted fine-tuning on MavCeleb for unheard scenarios, achieving an average EER of 23.99% and demonstrating the robustness of language-diverse, cross-modal embeddings. The work highlights the importance of domain adaptation and structured embedding learning for scalable face–voice verification across languages.

Abstract

The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.

Shared Multi-modal Embedding Space for Face-Voice Association

TL;DR

The paper tackles face–voice association in multilingual settings and unseen languages by training modality-specific audio and image pipelines that project into a shared embedding space using Adaptive Angular Margin loss. It leverages VoxCeleb2, CommonVoice, and MavCeleb data with heard/unheard language splits and cross-validation, and compares modality-separated embeddings to a fully fused cross-attention approach. The best-performing strategy combines pretraining on large datasets with targeted fine-tuning on MavCeleb for unheard scenarios, achieving an average EER of 23.99% and demonstrating the robustness of language-diverse, cross-modal embeddings. The work highlights the importance of domain adaptation and structured embedding learning for scalable face–voice verification across languages.

Abstract

The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.

Paper Structure

This paper contains 13 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overall model architecture, comprising separate modality-specific processing pipelines, dedicated mapping layers and a shared classifier. Both pipelines are optimized using the AAM loss.