Table of Contents
Fetching ...

RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association

Abdul Hannan, Furqan Malik, Hina Jabbar, Syed Suleman Sadiq, Mubashir Noman

TL;DR

The paper addresses cross-modal face-voice verification in multilingual settings, particularly English-German, where prior fusion and alignment strategies struggle with cross-language semantics. It presents RFOP, a framework that projects face and voice embeddings into a shared latent space via linear layers, applies an attention-based fusion to highlight salient information, and uses a three-term loss L_total = α1 L_MSE + α2 L_OP + α3 L_CE to train the model. Key contributions include a two-stage latent-space projection, a robust attention-weighted fusion module, and empirical results on FAME26 V3 showing competitive EER and a 3rd-place ranking. The findings shed light on language-transfer dynamics in multilingual face-voice matching and provide guidance on fusion-design under noisy cross-language data.

Abstract

Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.

RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association

TL;DR

The paper addresses cross-modal face-voice verification in multilingual settings, particularly English-German, where prior fusion and alignment strategies struggle with cross-language semantics. It presents RFOP, a framework that projects face and voice embeddings into a shared latent space via linear layers, applies an attention-based fusion to highlight salient information, and uses a three-term loss L_total = α1 L_MSE + α2 L_OP + α3 L_CE to train the model. Key contributions include a two-stage latent-space projection, a robust attention-weighted fusion module, and empirical results on FAME26 V3 showing competitive EER and a 3rd-place ranking. The findings shed light on language-transfer dynamics in multilingual face-voice matching and provide guidance on fusion-design under noisy cross-language data.

Abstract

Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.

Paper Structure

This paper contains 3 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overall face-voice association network. Face and voice features are extracted from corresponding unimodal encoders and projected by means of linear projection layers obtaining embeddings $X_f$ and $X_v$, which are combined by fusion module to obtain fused embeddings. Finally, the network is optimized using combination of three losses given as: $L_{total}=\alpha_1 L_{MSE} + \alpha_2 L_{OP} + \alpha_3 L_{CE}$.