Table of Contents
Fetching ...

Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan

Marta Moscati, Ahmed Abdullah, Muhammad Saad Saeed, Shah Nawaz, Rohan Kumar Das, Muhammad Zaigham Zaheer, Junaid Mir, Muhammad Haroon Yousaf, Khalid Malik, Markus Schedl

TL;DR

The paper addresses face-voice association in multilingual environments by introducing the FAME 2026 Challenge and the MAV-Celeb dataset, enabling cross-language verification of face-voice pairs. It adopts a baseline two-branch multimodal model with face and voice encoders and a gated fusion layer, optimized with $L_{CE}$ and $L_{OC}$ losses, evaluated via equal error rate (EER) across heard and unheard languages. Key contributions include the expanded multilingual dataset, the progress/evaluation protocol with V1-EU and V3-EG splits, and the emphasis on language transfer effects for cross-modal matching. The practical impact lies in guiding the development of robust, language-agnostic face-voice systems for real-world multilingual settings; the overall scoring aggregates EERs as $\text{Overall Score} = \frac{\sum \text{EERs}}{4}$ to benchmark participants.

Abstract

The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, audio-visual systems are among the most widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to the presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) 2026 Challenge focuses on exploring face-voice association under the unique condition of a multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenarios. The challenge uses a dataset named Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baseline models, and task details for the FAME Challenge.

Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan

TL;DR

The paper addresses face-voice association in multilingual environments by introducing the FAME 2026 Challenge and the MAV-Celeb dataset, enabling cross-language verification of face-voice pairs. It adopts a baseline two-branch multimodal model with face and voice encoders and a gated fusion layer, optimized with and losses, evaluated via equal error rate (EER) across heard and unheard languages. Key contributions include the expanded multilingual dataset, the progress/evaluation protocol with V1-EU and V3-EG splits, and the emphasis on language transfer effects for cross-modal matching. The practical impact lies in guiding the development of robust, language-agnostic face-voice systems for real-world multilingual settings; the overall scoring aggregates EERs as to benchmark participants.

Abstract

The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, audio-visual systems are among the most widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to the presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) 2026 Challenge focuses on exploring face-voice association under the unique condition of a multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenarios. The challenge uses a dataset named Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baseline models, and task details for the FAME Challenge.

Paper Structure

This paper contains 6 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: (Left) Face-Voice association is established with a cross-modal verification task nagrani2018learnable. (Right) The FAME 2026 Challenge extends the task to analyze the impact of multiple languages.
  • Figure 2: Audio-visual samples selected from the MAV-Celeb dataset. The visual data contains different variations such as pose, lighting condition, and motion. The left block shows data of celebrities speaking English. The right block shows data of the same celebrities speaking German.
  • Figure 3: Overall architecture of the baseline method. Face and voice embeddings are extracted by utilizing vision and audio encoders, respectively. Extracted features are then fed to linear layers to obtain the projected features of dimension $D$. Afterwards, embeddings are fused by using a gated feature fusion module. The fused features are fed to the logits layer. The model parameters are optimized by means of a linear combination of cross-entropy ($L_{CE}$) and orthogonal constraints ($L_{OC}$) losses.