A Generative Framework for Self-Supervised Facial Representation Learning

Ruian He; Zhen Xing; Weimin Tan; Bo Yan

A Generative Framework for Self-Supervised Facial Representation Learning

Ruian He, Zhen Xing, Weimin Tan, Bo Yan

TL;DR

LatentFace introduces a 3D-aware latent diffusion framework for self-supervised facial representation learning. By first performing 3D latent autoencoding to disentangle texture, shape, pose, and lighting, and then applying a Representation Diffusion Model to separate time-invariant identity from expression in latent space, the approach achieves state-of-the-art unsupervised FER and face verification on RAF-DB, AffectNet, LFW, and SLLFW. Ablation studies show that both 3D factor disentangling and diffusion-based latent disentangling contribute robust gains, with shape-driven improvements for FER and texture-driven gains for verification. The method offers interpretable, controllable facial representations with strong practical impact while acknowledging potential privacy and bias concerns in face editing and deployment.

Abstract

Self-supervised representation learning has gained increasing attention for strong generalization ability without relying on paired datasets. However, it has not been explored sufficiently for facial representation. Self-supervised facial representation learning remains unsolved due to the coupling of facial identities, expressions, and external factors like pose and light. Prior methods primarily focus on contrastive learning and pixel-level consistency, leading to limited interpretability and suboptimal performance. In this paper, we propose LatentFace, a novel generative framework for self-supervised facial representations. We suggest that the disentangling problem can be also formulated as generative objectives in space and time, and propose the solution using a 3D-aware latent diffusion model. First, we introduce a 3D-aware autoencoder to encode face images into 3D latent embeddings. Second, we propose a novel representation diffusion model to disentangle 3D latent into facial identity and expression. Consequently, our method achieves state-of-the-art performance in facial expression recognition (FER) and face verification among self-supervised facial representation learning models. Our model achieves a 3.75\% advantage in FER accuracy on RAF-DB and 3.35\% on AffectNet compared to SOTA methods.

A Generative Framework for Self-Supervised Facial Representation Learning

TL;DR

Abstract

Paper Structure (29 sections, 10 equations, 14 figures, 5 tables)

This paper contains 29 sections, 10 equations, 14 figures, 5 tables.

Introduction
Methodology
Revisiting Facial Representation Learning
3D Latent Autoencoding
Latent Space Disentangling
Representation Diffusion Model
Experiments
Implementation Details
Architecture
Training Datasets
Training Procedure
Evalutation Settings
Baselines
Evaluation Protocol
Evaluation of Interpretable Representations
...and 14 more sections

Figures (14)

Figure 1: Comparison of different learning paradigm. Our generative framework enables better spatial-temporal awareness and more thorough representation disentangling than previous contrastive learning methods.
Figure 2: Overview of the proposed framework. In the first stage, we disentangle 3D factors, including texture, shape, pose, and light, through the training of autoencoders (comprising encoders $\mathcal{E}$ and decoders $\mathcal{D}$) and render them using a renderer $\mathcal{R}$ to reconstruct the input. In the second stage, we further disentangle the texture and shape latent and train the Representation Diffusion Model (RDM) $\mathcal{E}{r}$ to generate the identity latent $Z_{id}$ from the emotional face latent $Z_{exp}$.
Figure 3: Visualization of generated facial identity. We show the reconstructed 3D face of 2 frames in the input video sequence. The facial identity is disentangled from expressions.
Figure 4: Comparison of disentangled representations. Our method have a more detailed and fidelity representation than the SOTA method FaceCycle Chang_2021_ICCV.
Figure 5: Face frontalization results. (1) is the output of the state-of-the-art method FaceCycle Chang_2021_ICCV. (2) Our model can restore the complete front face with lower Arcface distance.
...and 9 more figures

A Generative Framework for Self-Supervised Facial Representation Learning

TL;DR

Abstract

A Generative Framework for Self-Supervised Facial Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)