Table of Contents
Fetching ...

Liveness Detection in Computer Vision: Transformer-based Self-Supervised Learning for Face Anti-Spoofing

Arman Keresh, Pakizar Shamoi

TL;DR

This study highlights the superior performance of transformer-based architecture in identifying complex spoofing cues, leading to significant advancements in biometric security.

Abstract

Face recognition systems are increasingly used in biometric security for convenience and effectiveness. However, they remain vulnerable to spoofing attacks, where attackers use photos, videos, or masks to impersonate legitimate users. This research addresses these vulnerabilities by exploring the Vision Transformer (ViT) architecture, fine-tuned with the DINO framework. The DINO framework facilitates self-supervised learning, enabling the model to learn distinguishing features from unlabeled data. We compared the performance of the proposed fine-tuned ViT model using the DINO framework against a traditional CNN model, EfficientNet b2, on the face anti-spoofing task. Numerous tests on standard datasets show that the ViT model performs better than the CNN model in terms of accuracy and resistance to different spoofing methods. Additionally, we collected our own dataset from a biometric application to validate our findings further. This study highlights the superior performance of transformer-based architecture in identifying complex spoofing cues, leading to significant advancements in biometric security.

Liveness Detection in Computer Vision: Transformer-based Self-Supervised Learning for Face Anti-Spoofing

TL;DR

This study highlights the superior performance of transformer-based architecture in identifying complex spoofing cues, leading to significant advancements in biometric security.

Abstract

Face recognition systems are increasingly used in biometric security for convenience and effectiveness. However, they remain vulnerable to spoofing attacks, where attackers use photos, videos, or masks to impersonate legitimate users. This research addresses these vulnerabilities by exploring the Vision Transformer (ViT) architecture, fine-tuned with the DINO framework. The DINO framework facilitates self-supervised learning, enabling the model to learn distinguishing features from unlabeled data. We compared the performance of the proposed fine-tuned ViT model using the DINO framework against a traditional CNN model, EfficientNet b2, on the face anti-spoofing task. Numerous tests on standard datasets show that the ViT model performs better than the CNN model in terms of accuracy and resistance to different spoofing methods. Additionally, we collected our own dataset from a biometric application to validate our findings further. This study highlights the superior performance of transformer-based architecture in identifying complex spoofing cues, leading to significant advancements in biometric security.
Paper Structure (13 sections, 4 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 4 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Trainig data distribution by dataset and label
  • Figure 2: Validation data distribution by dataset and label
  • Figure 3: Sample images from the dataset, illustrating various genuine (“live”) and fake (“fake”) examples. The dataset includes a variety of facial images, including spoofing techniques such as printed photographs, screen images.
  • Figure 4: The input face image is split into patches, which are then projected linearly and embedded with positional information. These embeddings go into the Transformer encoder, which processes the sequence of patches. The encoder's output is then passed through a multi-layer perceptron (MLP) head to classify the image as either "spoof" or "live."
  • Figure 5: This figure illustrates the DINO (Distillation with No Labels) model training process. It starts with image augmentations (1), where two augmented views of the same image are generated. The student model processes one view, while the teacher model processes the other (2). The teacher model's outputs are centered and passed through a softmax layer (3). The student's outputs are optimized using Stochastic Gradient Descent (SGD) to match the teacher's outputs via an exponential moving average (EMA) update (4), minimizing the cross-entropy loss between the student's and teacher's predictions.
  • ...and 3 more figures