Face Pyramid Vision Transformer

Khawar Islam; Muhammad Zaigham Zaheer; Arif Mahmood

Face Pyramid Vision Transformer

Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood

TL;DR

The paper addresses the challenge of efficient, high-accuracy face recognition and verification by introducing the Face Pyramid Vision Transformer (FPVT), a four-stage pyramid transformer that learns multi-scale facial representations. FPVT integrates an Improved Patch Embedding (IPE) to model local-to-global facial structure, a Convolutional Feed-Forward Network (CFFN) to capture locality, a light-weight Face Spatial Reduction Attention (F-SRA) to reduce computation, and a Face Dimensionality Reduction (FDR) layer to keep features compact. Empirical results on seven datasets show FPVT achieves competitive or superior accuracy with fewer parameters than state-of-the-art CNNs, pure ViTs, and convolutional ViTs, with ablations confirming the contributions of IPE, CFFN, F-SRA, and FDR. The approach enables efficient, scalable FR/verification suitable for resource-constrained settings and real-world deployment while maintaining strong generalization across pose, age, and expression variations.

Abstract

A novel Face Pyramid Vision Transformer (FPVT) is proposed to learn a discriminative multi-scale facial representations for face recognition and verification. In FPVT, Face Spatial Reduction Attention (FSRA) and Dimensionality Reduction (FDR) layers are employed to make the feature maps compact, thus reducing the computations. An Improved Patch Embedding (IPE) algorithm is proposed to exploit the benefits of CNNs in ViTs (e.g., shared weights, local context, and receptive fields) to model lower-level edges to higher-level semantic primitives. Within FPVT framework, a Convolutional Feed-Forward Network (CFFN) is proposed that extracts locality information to learn low level facial information. The proposed FPVT is evaluated on seven benchmark datasets and compared with ten existing state-of-the-art methods, including CNNs, pure ViTs, and Convolutional ViTs. Despite fewer parameters, FPVT has demonstrated excellent performance over the compared methods. Project page is available at https://khawar-islam.github.io/fpvt/

Face Pyramid Vision Transformer

TL;DR

Abstract

Face Pyramid Vision Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (1)