Table of Contents
Fetching ...

WhisperNetV2: SlowFast Siamese Network For Lip-Based Biometrics

Abdollah Zakeri, Hamid Hassanpour, Mohammad Hossein Khosravi, Amir Masoud Nourollah

TL;DR

This work introduces WhisperNetV2, a SlowFast Siamese architecture with triplet loss for lip-based biometric authentication (LBBA). By separating a fast pathway for lip motion and a slow pathway for static appearance, the model captures both behavioral and physiological lip features, aiming to be invariant to client-emotion variations during video capture. Trained on the open-set CREMA-D dataset and cropped to lip regions, WhisperNetV2 achieves an Equal Error Rate of $0.005$ on unseen subjects, outperforming previous LBBA methods while reducing model size and avoiding lip-landmark dependencies. The study demonstrates strong performance and efficiency, with discussions on limitations and plans for larger-scale datasets to bolster robustness and generalization in real-world scenarios.

Abstract

Lip-based biometric authentication (LBBA) has attracted many researchers during the last decade. The lip is specifically interesting for biometric researchers because it is a twin biometric with the potential to function both as a physiological and a behavioral trait. Although much valuable research was conducted on LBBA, none of them considered the different emotions of the client during the video acquisition step of LBBA, which can potentially affect the client's facial expressions and speech tempo. We proposed a novel network structure called WhisperNetV2, which extends our previously proposed network called WhisperNet. Our proposed network leverages a deep Siamese structure with triplet loss having three identical SlowFast networks as embedding networks. The SlowFast network is an excellent candidate for our task since the fast pathway extracts motion-related features (behavioral lip movements) with a high frame rate and low channel capacity. The slow pathway extracts visual features (physiological lip appearance) with a low frame rate and high channel capacity. Using an open-set protocol, we trained our network using the CREMA-D dataset and acquired an Equal Error Rate (EER) of 0.005 on the test set. Considering that the acquired EER is less than most similar LBBA methods, our method can be considered as a state-of-the-art LBBA method.

WhisperNetV2: SlowFast Siamese Network For Lip-Based Biometrics

TL;DR

This work introduces WhisperNetV2, a SlowFast Siamese architecture with triplet loss for lip-based biometric authentication (LBBA). By separating a fast pathway for lip motion and a slow pathway for static appearance, the model captures both behavioral and physiological lip features, aiming to be invariant to client-emotion variations during video capture. Trained on the open-set CREMA-D dataset and cropped to lip regions, WhisperNetV2 achieves an Equal Error Rate of on unseen subjects, outperforming previous LBBA methods while reducing model size and avoiding lip-landmark dependencies. The study demonstrates strong performance and efficiency, with discussions on limitations and plans for larger-scale datasets to bolster robustness and generalization in real-world scenarios.

Abstract

Lip-based biometric authentication (LBBA) has attracted many researchers during the last decade. The lip is specifically interesting for biometric researchers because it is a twin biometric with the potential to function both as a physiological and a behavioral trait. Although much valuable research was conducted on LBBA, none of them considered the different emotions of the client during the video acquisition step of LBBA, which can potentially affect the client's facial expressions and speech tempo. We proposed a novel network structure called WhisperNetV2, which extends our previously proposed network called WhisperNet. Our proposed network leverages a deep Siamese structure with triplet loss having three identical SlowFast networks as embedding networks. The SlowFast network is an excellent candidate for our task since the fast pathway extracts motion-related features (behavioral lip movements) with a high frame rate and low channel capacity. The slow pathway extracts visual features (physiological lip appearance) with a low frame rate and high channel capacity. Using an open-set protocol, we trained our network using the CREMA-D dataset and acquired an Equal Error Rate (EER) of 0.005 on the test set. Considering that the acquired EER is less than most similar LBBA methods, our method can be considered as a state-of-the-art LBBA method.
Paper Structure (6 sections, 3 equations, 7 figures, 2 tables)

This paper contains 6 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Flowchart of our proposed method for pre-processing and visual speaker authentication
  • Figure 2: Structure of a Siamese Network
  • Figure 3: Slow-Fast Network Architecture
  • Figure 4: Training loss curve for our proposed network. Due to fluctuations in the loss values, the moving average for loss values is plotted.
  • Figure 5: Training loss curve for our proposed network. Due to fluctuations in the loss values, the moving average for loss values is plotted.
  • ...and 2 more figures