Table of Contents
Fetching ...

Towards Language-Independent Face-Voice Association with Multimodal Foundation Models

Aref Farhadipour, Teodora Vukovic, Volker Dellwo

TL;DR

The paper tackles cross-modal face–voice verification under multilingual and unseen-language settings. It compares a CLIP-style dual-encoder with an ImageBind-based model, enhanced by LoRA adaptation and an externally curated Arabic VoxBlink dataset to promote language-agnostic identity representations. Empirical results show that ImageBind-LoRA generalizes well across English, German, and Urdu, achieving strong EER performance and placing second in FAME 2026. The findings suggest that foundation-model-based, parameter-efficient fine-tuning can outperform traditional supervised approaches on limited multilingual data for robust cross-modal biometrics.

Abstract

This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.

Towards Language-Independent Face-Voice Association with Multimodal Foundation Models

TL;DR

The paper tackles cross-modal face–voice verification under multilingual and unseen-language settings. It compares a CLIP-style dual-encoder with an ImageBind-based model, enhanced by LoRA adaptation and an externally curated Arabic VoxBlink dataset to promote language-agnostic identity representations. Empirical results show that ImageBind-LoRA generalizes well across English, German, and Urdu, achieving strong EER performance and placing second in FAME 2026. The findings suggest that foundation-model-based, parameter-efficient fine-tuning can outperform traditional supervised approaches on limited multilingual data for robust cross-modal biometrics.

Abstract

This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.

Paper Structure

This paper contains 11 sections, 2 tables.