Table of Contents
Fetching ...

A Generative Framework for Self-Supervised Facial Representation Learning

Ruian He, Zhen Xing, Weimin Tan, Bo Yan

TL;DR

LatentFace introduces a 3D-aware latent diffusion framework for self-supervised facial representation learning. By first performing 3D latent autoencoding to disentangle texture, shape, pose, and lighting, and then applying a Representation Diffusion Model to separate time-invariant identity from expression in latent space, the approach achieves state-of-the-art unsupervised FER and face verification on RAF-DB, AffectNet, LFW, and SLLFW. Ablation studies show that both 3D factor disentangling and diffusion-based latent disentangling contribute robust gains, with shape-driven improvements for FER and texture-driven gains for verification. The method offers interpretable, controllable facial representations with strong practical impact while acknowledging potential privacy and bias concerns in face editing and deployment.

Abstract

Self-supervised representation learning has gained increasing attention for strong generalization ability without relying on paired datasets. However, it has not been explored sufficiently for facial representation. Self-supervised facial representation learning remains unsolved due to the coupling of facial identities, expressions, and external factors like pose and light. Prior methods primarily focus on contrastive learning and pixel-level consistency, leading to limited interpretability and suboptimal performance. In this paper, we propose LatentFace, a novel generative framework for self-supervised facial representations. We suggest that the disentangling problem can be also formulated as generative objectives in space and time, and propose the solution using a 3D-aware latent diffusion model. First, we introduce a 3D-aware autoencoder to encode face images into 3D latent embeddings. Second, we propose a novel representation diffusion model to disentangle 3D latent into facial identity and expression. Consequently, our method achieves state-of-the-art performance in facial expression recognition (FER) and face verification among self-supervised facial representation learning models. Our model achieves a 3.75\% advantage in FER accuracy on RAF-DB and 3.35\% on AffectNet compared to SOTA methods.

A Generative Framework for Self-Supervised Facial Representation Learning

TL;DR

LatentFace introduces a 3D-aware latent diffusion framework for self-supervised facial representation learning. By first performing 3D latent autoencoding to disentangle texture, shape, pose, and lighting, and then applying a Representation Diffusion Model to separate time-invariant identity from expression in latent space, the approach achieves state-of-the-art unsupervised FER and face verification on RAF-DB, AffectNet, LFW, and SLLFW. Ablation studies show that both 3D factor disentangling and diffusion-based latent disentangling contribute robust gains, with shape-driven improvements for FER and texture-driven gains for verification. The method offers interpretable, controllable facial representations with strong practical impact while acknowledging potential privacy and bias concerns in face editing and deployment.

Abstract

Self-supervised representation learning has gained increasing attention for strong generalization ability without relying on paired datasets. However, it has not been explored sufficiently for facial representation. Self-supervised facial representation learning remains unsolved due to the coupling of facial identities, expressions, and external factors like pose and light. Prior methods primarily focus on contrastive learning and pixel-level consistency, leading to limited interpretability and suboptimal performance. In this paper, we propose LatentFace, a novel generative framework for self-supervised facial representations. We suggest that the disentangling problem can be also formulated as generative objectives in space and time, and propose the solution using a 3D-aware latent diffusion model. First, we introduce a 3D-aware autoencoder to encode face images into 3D latent embeddings. Second, we propose a novel representation diffusion model to disentangle 3D latent into facial identity and expression. Consequently, our method achieves state-of-the-art performance in facial expression recognition (FER) and face verification among self-supervised facial representation learning models. Our model achieves a 3.75\% advantage in FER accuracy on RAF-DB and 3.35\% on AffectNet compared to SOTA methods.
Paper Structure (29 sections, 10 equations, 14 figures, 5 tables)

This paper contains 29 sections, 10 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Comparison of different learning paradigm. Our generative framework enables better spatial-temporal awareness and more thorough representation disentangling than previous contrastive learning methods.
  • Figure 2: Overview of the proposed framework. In the first stage, we disentangle 3D factors, including texture, shape, pose, and light, through the training of autoencoders (comprising encoders $\mathcal{E}$ and decoders $\mathcal{D}$) and render them using a renderer $\mathcal{R}$ to reconstruct the input. In the second stage, we further disentangle the texture and shape latent and train the Representation Diffusion Model (RDM) $\mathcal{E}{r}$ to generate the identity latent $Z_{id}$ from the emotional face latent $Z_{exp}$.
  • Figure 3: Visualization of generated facial identity. We show the reconstructed 3D face of 2 frames in the input video sequence. The facial identity is disentangled from expressions.
  • Figure 4: Comparison of disentangled representations. Our method have a more detailed and fidelity representation than the SOTA method FaceCycle Chang_2021_ICCV.
  • Figure 5: Face frontalization results. (1) is the output of the state-of-the-art method FaceCycle Chang_2021_ICCV. (2) Our model can restore the complete front face with lower Arcface distance.
  • ...and 9 more figures