Table of Contents
Fetching ...

PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association

Abdul Hannan, Muhammad Arslan Manzoor, Shah Nawaz, Muhammad Irzam Liaqat, Markus Schedl, Mubashir Noman

TL;DR

This work tackles face-voice association, a cross-modal task, by addressing the misalignment between face and voice embedding spaces and the limitations of margin-based losses. It introduces a dual-branch framework that aligns embeddings in the hyperbolic space $\mathbb{H}_2$ using a symmetric cross-entropy alignment loss $\mathcal{L}_{A}$ and then fuses the aligned features with an Enhanced Gated Feature Fusion (EGFF) module, optimized with $\mathcal{L} = \alpha_{1} \mathcal{L}_{A} + \alpha_{2} \mathcal{L}_{OP} + \alpha_{3} \mathcal{L}_{CE}$. Key contributions include precise hyperbolic-space alignment prior to fusion, a novel attention-based fusion mechanism, and extensive VoxCeleb experiments showing state-of-the-art cross-modal verification (EER and AUC) and robust cross-modal matching. The results demonstrate that aligning multimodal embeddings before fusion yields substantial performance gains, with the approach achieving an EER of $14.3\%$ (seen-heard) and $22.9\%$ (unseen-unheard), and an AUC of $93.8\%$ (seen-heard) and $84.4\%$ (unseen-unheard). These findings suggest practical impact for biometric verification and multimedia retrieval, with future directions toward multilingual face-voice tasks and speaker diarization.

Abstract

We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.

PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association

TL;DR

This work tackles face-voice association, a cross-modal task, by addressing the misalignment between face and voice embedding spaces and the limitations of margin-based losses. It introduces a dual-branch framework that aligns embeddings in the hyperbolic space using a symmetric cross-entropy alignment loss and then fuses the aligned features with an Enhanced Gated Feature Fusion (EGFF) module, optimized with . Key contributions include precise hyperbolic-space alignment prior to fusion, a novel attention-based fusion mechanism, and extensive VoxCeleb experiments showing state-of-the-art cross-modal verification (EER and AUC) and robust cross-modal matching. The results demonstrate that aligning multimodal embeddings before fusion yields substantial performance gains, with the approach achieving an EER of (seen-heard) and (unseen-unheard), and an AUC of (seen-heard) and (unseen-unheard). These findings suggest practical impact for biometric verification and multimedia retrieval, with future directions toward multilingual face-voice tasks and speaker diarization.

Abstract

We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.

Paper Structure

This paper contains 10 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: (a) Overall illustration of the proposed face-voice association approach. Face ($X_f$) and voice ($X_v$) features are extracted by utilizing vision and audio encoders, respectively. Extracted features $X_f$ and $X_v$ are then fed to linear layers to obtain the projected features of dimension $D$. The projected features are transformed to hyperbolic space ($\mathbb H_2$) for accurate alignment of feature representations. Symmetric cross-entropy loss $L_{A}$ is utilized to align the feature representations. Afterwards, aligned features are fused by using enhanced gated feature fusion (EGFF) module. The fused features are fed to the logits layer and optimized by means of orthogonal projection ($L_{OP}$) and cross-entropy ($L_{CE}$) losses. (b) Detailed overview of EGFF is shown on the right side. AW refers to the computation of attention weights for feature fusion.
  • Figure 2: Cross-modal matching results of the proposed model and existing SOTA methods with varying gallery size ($n_c$).