WhisperNetV2: SlowFast Siamese Network For Lip-Based Biometrics

Abdollah Zakeri; Hamid Hassanpour; Mohammad Hossein Khosravi; Amir Masoud Nourollah

WhisperNetV2: SlowFast Siamese Network For Lip-Based Biometrics

Abdollah Zakeri, Hamid Hassanpour, Mohammad Hossein Khosravi, Amir Masoud Nourollah

TL;DR

This work introduces WhisperNetV2, a SlowFast Siamese architecture with triplet loss for lip-based biometric authentication (LBBA). By separating a fast pathway for lip motion and a slow pathway for static appearance, the model captures both behavioral and physiological lip features, aiming to be invariant to client-emotion variations during video capture. Trained on the open-set CREMA-D dataset and cropped to lip regions, WhisperNetV2 achieves an Equal Error Rate of $0.005$ on unseen subjects, outperforming previous LBBA methods while reducing model size and avoiding lip-landmark dependencies. The study demonstrates strong performance and efficiency, with discussions on limitations and plans for larger-scale datasets to bolster robustness and generalization in real-world scenarios.

Abstract

Lip-based biometric authentication (LBBA) has attracted many researchers during the last decade. The lip is specifically interesting for biometric researchers because it is a twin biometric with the potential to function both as a physiological and a behavioral trait. Although much valuable research was conducted on LBBA, none of them considered the different emotions of the client during the video acquisition step of LBBA, which can potentially affect the client's facial expressions and speech tempo. We proposed a novel network structure called WhisperNetV2, which extends our previously proposed network called WhisperNet. Our proposed network leverages a deep Siamese structure with triplet loss having three identical SlowFast networks as embedding networks. The SlowFast network is an excellent candidate for our task since the fast pathway extracts motion-related features (behavioral lip movements) with a high frame rate and low channel capacity. The slow pathway extracts visual features (physiological lip appearance) with a low frame rate and high channel capacity. Using an open-set protocol, we trained our network using the CREMA-D dataset and acquired an Equal Error Rate (EER) of 0.005 on the test set. Considering that the acquired EER is less than most similar LBBA methods, our method can be considered as a state-of-the-art LBBA method.

WhisperNetV2: SlowFast Siamese Network For Lip-Based Biometrics

TL;DR

on unseen subjects, outperforming previous LBBA methods while reducing model size and avoiding lip-landmark dependencies. The study demonstrates strong performance and efficiency, with discussions on limitations and plans for larger-scale datasets to bolster robustness and generalization in real-world scenarios.

Abstract

Paper Structure (6 sections, 3 equations, 7 figures, 2 tables)

This paper contains 6 sections, 3 equations, 7 figures, 2 tables.

Introduction
Literature Review
Dataset and pre-processing
Proposed Method
Experiments and Results
Conclusion and Future Works

Figures (7)

Figure 1: Flowchart of our proposed method for pre-processing and visual speaker authentication
Figure 2: Structure of a Siamese Network
Figure 3: Slow-Fast Network Architecture
Figure 4: Training loss curve for our proposed network. Due to fluctuations in the loss values, the moving average for loss values is plotted.
Figure 5: Training loss curve for our proposed network. Due to fluctuations in the loss values, the moving average for loss values is plotted.
...and 2 more figures

WhisperNetV2: SlowFast Siamese Network For Lip-Based Biometrics

TL;DR

Abstract

WhisperNetV2: SlowFast Siamese Network For Lip-Based Biometrics

Authors

TL;DR

Abstract

Table of Contents

Figures (7)