Table of Contents
Fetching ...

Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan

Muhammad Saad Saeed, Shah Nawaz, Muhammad Salman Tahir, Rohan Kumar Das, Muhammad Zaigham Zaheer, Marta Moscati, Markus Schedl, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf

TL;DR

The paper addresses face-voice association under multilingual conditions by leveraging the MAV-Celeb dataset and a cross-modal verification framework. It introduces the FAME 2024 challenge objectives, a two-stream embedding baseline with fusion and orthogonal projection for joint face-voice representation, and an unheard/unseen evaluation protocol to assess language effects. Key contributions include dataset-driven multilingual analysis, a practical baseline, and an established evaluation workflow with EER as the metric, enabling comparison across heard, unheard, and completely unseen languages. The work holds practical significance for real-world multilingual biometric systems and multimodal verification under language variability, and it defines a public, reproducible benchmark and submission process. The overall score is computed as $ \text{Overall Score} = \frac{\sum \text{EER}}{4}$, reflecting aggregated recognition performance across configurations.

Abstract

The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge.

Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan

TL;DR

The paper addresses face-voice association under multilingual conditions by leveraging the MAV-Celeb dataset and a cross-modal verification framework. It introduces the FAME 2024 challenge objectives, a two-stream embedding baseline with fusion and orthogonal projection for joint face-voice representation, and an unheard/unseen evaluation protocol to assess language effects. Key contributions include dataset-driven multilingual analysis, a practical baseline, and an established evaluation workflow with EER as the metric, enabling comparison across heard, unheard, and completely unseen languages. The work holds practical significance for real-world multilingual biometric systems and multimodal verification under language variability, and it defines a public, reproducible benchmark and submission process. The overall score is computed as , reflecting aggregated recognition performance across configurations.

Abstract

The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge.
Paper Structure (9 sections, 1 equation, 4 figures, 3 tables)

This paper contains 9 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (Left) Standard Face-voice association is established with a cross-modal verification task. (Right) The FAME challenge 2024 extends the verification task to analyze the impact of multiple of languages.
  • Figure 2: Audio-visual samples selected from MAV-Celeb dataset. The visual data contains various variations such as pose, lighting condition and motion. (Left) It contains information of celebrities speaking English and the (Right) block presents data of the same celebrity in Hindi.
  • Figure 3: MAV-Celeb file structure.
  • Figure 4: Overall architecture of our baseline method. Fundamentally, it is a two-stream pipeline which generates face and voice embeddings. We propose a light-weight, plug-and-play mechanism, dubbed as fusion and orthogonal projection (FOP) (shown in dotted red box).