Table of Contents
Fetching ...

TransFace++: Rethinking the Face Recognition Paradigm with a Focus on Accuracy, Efficiency, and Security

Jun Dan, Yang Liu, Baigui Sun, Jiankang Deng, Shan Luo

TL;DR

This paper addresses three FR challenges: (1) CNNs' limited global feature modeling, (2) RGB decoding bottlenecks harming efficiency, and (3) privacy risks from raw RGB inputs. It introduces TransFace, a ViT-based FR backbone with patch-level DPAP and entropy-driven EHSM to improve accuracy and robustness, and TransFace++, a privacy-preserving variant that operates directly on image bytes using Topology-based Image Bytes Compression (TIBC) and Structure Information-guided Cross-Attention (SICA). The results show TransFace achieves competitive or superior performance to RGB-based and ViT baselines on major benchmarks, while TransFace++ delivers strong FR accuracy from encrypted bytes and demonstrates potential for privacy-preserving deployment. Collectively, these methods advance FR by boosting accuracy, efficiency, and security, and open avenues for byte-based FR pipelines and privacy-preserving architectures, with optimization objectives including $\mathcal{L}_{cls}^{trans}$ and $\mathcal{L}_{cls}^{byte}$ for learning from RGB patches and image bytes, respectively.

Abstract

Face Recognition (FR) technology has made significant strides with the emergence of deep learning. Typically, most existing FR models are built upon Convolutional Neural Networks (CNN) and take RGB face images as the model's input. In this work, we take a closer look at existing FR paradigms from high-efficiency, security, and precision perspectives, and identify the following three problems: (i) CNN frameworks are vulnerable in capturing global facial features and modeling the correlations between local facial features. (ii) Selecting RGB face images as the model's input greatly degrades the model's inference efficiency, increasing the extra computation costs. (iii) In the real-world FR system that operates on RGB face images, the integrity of user privacy may be compromised if hackers successfully penetrate and gain access to the input of this model. To solve these three issues, we propose two novel FR frameworks, i.e., TransFace and TransFace++, which successfully explore the feasibility of applying ViTs and image bytes to FR tasks, respectively. Experiments on popular face benchmarks demonstrate the superiority of our TransFace and TransFace++. Code is available at https://github.com/DanJun6737/TransFace_pp.

TransFace++: Rethinking the Face Recognition Paradigm with a Focus on Accuracy, Efficiency, and Security

TL;DR

This paper addresses three FR challenges: (1) CNNs' limited global feature modeling, (2) RGB decoding bottlenecks harming efficiency, and (3) privacy risks from raw RGB inputs. It introduces TransFace, a ViT-based FR backbone with patch-level DPAP and entropy-driven EHSM to improve accuracy and robustness, and TransFace++, a privacy-preserving variant that operates directly on image bytes using Topology-based Image Bytes Compression (TIBC) and Structure Information-guided Cross-Attention (SICA). The results show TransFace achieves competitive or superior performance to RGB-based and ViT baselines on major benchmarks, while TransFace++ delivers strong FR accuracy from encrypted bytes and demonstrates potential for privacy-preserving deployment. Collectively, these methods advance FR by boosting accuracy, efficiency, and security, and open avenues for byte-based FR pipelines and privacy-preserving architectures, with optimization objectives including and for learning from RGB patches and image bytes, respectively.

Abstract

Face Recognition (FR) technology has made significant strides with the emergence of deep learning. Typically, most existing FR models are built upon Convolutional Neural Networks (CNN) and take RGB face images as the model's input. In this work, we take a closer look at existing FR paradigms from high-efficiency, security, and precision perspectives, and identify the following three problems: (i) CNN frameworks are vulnerable in capturing global facial features and modeling the correlations between local facial features. (ii) Selecting RGB face images as the model's input greatly degrades the model's inference efficiency, increasing the extra computation costs. (iii) In the real-world FR system that operates on RGB face images, the integrity of user privacy may be compromised if hackers successfully penetrate and gain access to the input of this model. To solve these three issues, we propose two novel FR frameworks, i.e., TransFace and TransFace++, which successfully explore the feasibility of applying ViTs and image bytes to FR tasks, respectively. Experiments on popular face benchmarks demonstrate the superiority of our TransFace and TransFace++. Code is available at https://github.com/DanJun6737/TransFace_pp.
Paper Structure (27 sections, 1 theorem, 20 equations, 11 figures, 10 tables)

This paper contains 27 sections, 1 theorem, 20 equations, 11 figures, 10 tables.

Key Result

Theorem 1

For any continuous distribution $\mathbb{D}(a)$ of mean $\mu$ and variance $\sigma^{2}$, its differential entropy is maximized when $\mathbb{D}(a)$ is a Gaussian distribution $\mathcal{N}(\mu,\sigma^{2})$.

Figures (11)

  • Figure 1: (a): Existing inference paradigms of FR models primarily rely on RGB face images, which poses a risk of privacy leakage for users. (b): Our TransFace++ framework is able to directly operate on encrypted image bytes without reconstructing RGB images, greatly protecting user privacy.
  • Figure 2: Top: Previous data augmentation approaches may destroy the fidelity and structural information of face identity when augmenting samples. Our DPAP strategy not only constructs diverse samples but also effectively preserves the key information of the face. Bottom: Existing hard sample mining methods usually adopt several instance-level indicators to measure sample difficulty, which is suboptimal for ViTs. Our EHSM strategy leverages information entropy from all local tokens to mine hard samples.
  • Figure 3: Global overview of the proposed TransFace model. To alleviate the overfitting problem in ViTs, the DPAP strategy employs the SE module to screen out the top-$K_{0}$ dominant patches, and then randomly perturbs their amplitude information to expand sample diversity. Furthermore, to effectively mine hard samples and enhance the feature presentation power of local tokens, the EHSM strategy utilizes an entropy-aware weight mechanism to re-weight the classification loss. $n$ is the total number of patches, and $\bigotimes$ denotes the multiplication operation between the local token and the scaling factor generated by the SE module. The image patches with red boxes represent dominant patches.
  • Figure 4: Demonstration of phase-only reconstructed face image and amplitude-only reconstructed face image. Upper row: Face image is reconstructed using only the phase information by setting the amplitude information to a constant. Bottom row: Face image is reconstructed using only the amplitude information by making the phase component constant.
  • Figure 5: Example images and corresponding information entropy. Samples labeled with the same ID are displayed in each column. First row: Easy samples usually contain richer information (i.e., larger information entropy). Second row: Hard samples usually contain less information (i.e., lower information entropy).
  • ...and 6 more figures

Theorems & Definitions (1)

  • Theorem 1