One-shot lip-based biometric authentication: extending behavioral features with authentication phrase information

Brando Koch; Ratko Grbić

One-shot lip-based biometric authentication: extending behavioral features with authentication phrase information

Brando Koch, Ratko Grbić

TL;DR

This work addresses the vulnerability of lip-based biometric authentication to video replay by introducing phrase-aware one-shot learning. It trains a Siamese network with a LipNet-inspired backbone on a customized GRID dataset that encodes both who the speaker is and what phrase is spoken, using a batch-wise hard-negative triplet loss to learn a discriminative embedding. The approach achieves FAR and FRR around 3% on a test set and demonstrates that behavioral features contribute more when phrases differ, while showing resilience against replay attacks. The dataset construction, network design, and loss formulation offer a practical path to more secure LBBA systems in real-world scenarios.

Abstract

Lip-based biometric authentication (LBBA) is an authentication method based on a person's lip movements during speech in the form of video data captured by a camera sensor. LBBA can utilize both physical and behavioral characteristics of lip movements without requiring any additional sensory equipment apart from an RGB camera. State-of-the-art (SOTA) approaches use one-shot learning to train deep siamese neural networks which produce an embedding vector out of these features. Embeddings are further used to compute the similarity between an enrolled user and a user being authenticated. A flaw of these approaches is that they model behavioral features as style-of-speech without relation to what is being said. This makes the system vulnerable to video replay attacks of the client speaking any phrase. To solve this problem we propose a one-shot approach which models behavioral features to discriminate against what is being said in addition to style-of-speech. We achieve this by customizing the GRID dataset to obtain required triplets and training a siamese neural network based on 3D convolutions and recurrent neural network layers. A custom triplet loss for batch-wise hard-negative mining is proposed. Obtained results using an open-set protocol are 3.2% FAR and 3.8% FRR on the test set of the customized GRID dataset. Additional analysis of the results was done to quantify the influence and discriminatory power of behavioral and physical features for LBBA.

One-shot lip-based biometric authentication: extending behavioral features with authentication phrase information

TL;DR

Abstract

Paper Structure (12 sections, 3 equations, 10 figures, 7 tables)

This paper contains 12 sections, 3 equations, 10 figures, 7 tables.

Introduction
Related work
Proposed approach to one-shot learning for LBBA
Customized GRID dataset
Dataset preprocessing
Model architecture
Hard-negative mining and the modified triplet loss function
Results and discussion
Experimental setup
Results
Conclusion
Distributions of test set pairs prediction scores

Figures (10)

Figure 1: Video frame examples from GRID dataset.
Figure 2: An example of the Face Mesh landmark mask superimposed on a frame of a speaking face from the GRID dataset.
Figure 3: An example of an image of the mouth region, which has been obtained by cropping a speaker's face image.
Figure 4: Main results on the test set.
Figure 5: Stacked histograms of test set pair prediction scores, with respect to pair type.
...and 5 more figures

One-shot lip-based biometric authentication: extending behavioral features with authentication phrase information

TL;DR

Abstract

One-shot lip-based biometric authentication: extending behavioral features with authentication phrase information

Authors

TL;DR

Abstract

Table of Contents

Figures (10)