Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification

Aref Farhadipour; Masoumeh Chapariniya; Teodora Vukovic; Volker Dellwo

Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification

Aref Farhadipour, Masoumeh Chapariniya, Teodora Vukovic, Volker Dellwo

TL;DR

A one-dimensional convolutional neural network is employed for x-vector extraction from voice, while the pre-trained VGGFace2 network and transfer learning are utilized for face modality, and gammatonegram is used as speech representation in engagement with the Darknet19 pre-trained network.

Abstract

Multimodal learning involves integrating information from various modalities to enhance learning and comprehension. We compare three modality fusion strategies in person identification and verification by processing two modalities: voice and face. In this paper, a one-dimensional convolutional neural network is employed for x-vector extraction from voice, while the pre-trained VGGFace2 network and transfer learning are utilized for face modality. In addition, gammatonegram is used as speech representation in engagement with the Darknet19 pre-trained network. The proposed systems are evaluated using the K-fold cross-validation technique on the 118 speakers of the test set of the VoxCeleb2 dataset. The comparative evaluations are done for single-modality and three proposed multimodal strategies in equal situations. Results demonstrate that the feature fusion strategy of gammatonegram and facial features achieves the highest performance, with an accuracy of 98.37% in the person identification task. However, concatenating facial features with the x-vector reaches 0.62% for EER in verification tasks.

Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification

TL;DR

Abstract

Paper Structure (12 sections, 1 equation, 3 figures, 2 tables)

This paper contains 12 sections, 1 equation, 3 figures, 2 tables.

Introduction
Related Works
Multimodal Learning Strategies
Sensor Level Fusion
Feature Level Fusion
Score Level Fusion
Evaluation Setup
Experimental Results
Person Identification
Person Verification
Discussion
Conclusion

Figures (3)

Figure 1: The block diagram of the proposed single-modality systems consists of VoiceNet, FaceNet, and three multimodal systems with different fusion strategies
Figure 2: Crucial parts of face images for decision-making in the FaceNet
Figure 3: Important part of gammatonegram image from the viewpoint of Darknet19 network for classification

Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification

TL;DR

Abstract

Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification

Authors

TL;DR

Abstract

Table of Contents

Figures (3)