Table of Contents
Fetching ...

FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device

Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh

TL;DR

FaceLiVT targets efficient real-time face recognition on mobile devices by marrying CNN features with Vision Transformer concepts through RepMix and Multi-Head Linear Attention. The paper introduces structural reparameterization to fuse BN with convolutions and MHLA to replace MHSA with linear complexity. Empirical results on LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C show state-of-the-art lightweight accuracy with substantially lower latency on mobile hardware (e.g., up to 8.6× faster than EdgeFace-XS0.6 and 21.2× faster than pure ViT). Ablation studies highlight the importance of reparameterization and MHLA head-count for balancing accuracy and speed. The work provides a practical path to real-time face recognition on resource-constrained platforms.

Abstract

This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.

FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device

TL;DR

FaceLiVT targets efficient real-time face recognition on mobile devices by marrying CNN features with Vision Transformer concepts through RepMix and Multi-Head Linear Attention. The paper introduces structural reparameterization to fuse BN with convolutions and MHLA to replace MHSA with linear complexity. Empirical results on LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C show state-of-the-art lightweight accuracy with substantially lower latency on mobile hardware (e.g., up to 8.6× faster than EdgeFace-XS0.6 and 21.2× faster than pure ViT). Ablation studies highlight the importance of reparameterization and MHLA head-count for balancing accuracy and speed. The work provides a practical path to real-time face recognition on resource-constrained platforms.

Abstract

This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.

Paper Structure

This paper contains 14 sections, 10 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: FaceLiVT architecture with Multi-Head Linear Attention (MHLA) and structural reparameterization. Stages 1 and 2 use the RepMix and the last stage used MHLA as token mixer. (a) FaceLiVT Block. (b) RepMix. (c) MHLA.