Towards Language-Independent Face-Voice Association with Multimodal Foundation Models
Aref Farhadipour, Teodora Vukovic, Volker Dellwo
TL;DR
The paper tackles cross-modal face–voice verification under multilingual and unseen-language settings. It compares a CLIP-style dual-encoder with an ImageBind-based model, enhanced by LoRA adaptation and an externally curated Arabic VoxBlink dataset to promote language-agnostic identity representations. Empirical results show that ImageBind-LoRA generalizes well across English, German, and Urdu, achieving strong EER performance and placing second in FAME 2026. The findings suggest that foundation-model-based, parameter-efficient fine-tuning can outperform traditional supervised approaches on limited multilingual data for robust cross-modal biometrics.
Abstract
This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.
