RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association
Abdul Hannan, Furqan Malik, Hina Jabbar, Syed Suleman Sadiq, Mubashir Noman
TL;DR
The paper addresses cross-modal face-voice verification in multilingual settings, particularly English-German, where prior fusion and alignment strategies struggle with cross-language semantics. It presents RFOP, a framework that projects face and voice embeddings into a shared latent space via linear layers, applies an attention-based fusion to highlight salient information, and uses a three-term loss L_total = α1 L_MSE + α2 L_OP + α3 L_CE to train the model. Key contributions include a two-stage latent-space projection, a robust attention-weighted fusion module, and empirical results on FAME26 V3 showing competitive EER and a 3rd-place ranking. The findings shed light on language-transfer dynamics in multilingual face-voice matching and provide guidance on fusion-design under noisy cross-language data.
Abstract
Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.
