Table of Contents
Fetching ...

FaceGPT: Self-supervised Learning to Chat about 3D Human Faces

Haoran Wang, Mohit Mendiratta, Christian Theobalt, Adam Kortylewski

TL;DR

FaceGPT introduces a self-supervised framework that endows a Large Vision-Language Model with 3D face reasoning by embedding 3DMM parameters $\theta=[\alpha, \delta, \gamma, \phi, c]$ into the token space through a <FACE> token and a differentiable renderer. It trains as a model-based autoencoder from in-the-wild images, using a self-supervised loss that enforces 3DMM regression via image reconstruction and a text-prediction objective to preserve instruction-following, all without 3D annotations. The approach supports both image-to-3DMM and text-to-3DMM generation, as well as multi-modal instruction following, enabling 3D face reasoning in a generalist MM-LLM. Experiments show competitive monocular 3D face reconstruction against specialized methods and strong text-driven 3D face generation, illustrating a viable path to integrating detailed 3D world knowledge with broad language capabilities in a fully self-supervised regime.

Abstract

We introduce FaceGPT, a self-supervised learning framework for Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text. Typical 3D face reconstruction methods are specialized algorithms that lack semantic reasoning capabilities. FaceGPT overcomes this limitation by embedding the parameters of a 3D morphable face model (3DMM) into the token space of a VLM, enabling the generation of 3D faces from both textual and visual inputs. FaceGPT is trained in a self-supervised manner as a model-based autoencoder from in-the-wild images. In particular, the hidden state of LLM is projected into 3DMM parameters and subsequently rendered as 2D face image to guide the self-supervised learning process via image-based reconstruction. Without relying on expensive 3D annotations of human faces, FaceGPT obtains a detailed understanding about 3D human faces, while preserving the capacity to understand general user instructions. Our experiments demonstrate that FaceGPT not only achieves high-quality 3D face reconstructions but also retains the ability for general-purpose visual instruction following. Furthermore, FaceGPT learns fully self-supervised to generate 3D faces based on complex textual inputs, which opens a new direction in human face analysis.

FaceGPT: Self-supervised Learning to Chat about 3D Human Faces

TL;DR

FaceGPT introduces a self-supervised framework that endows a Large Vision-Language Model with 3D face reasoning by embedding 3DMM parameters into the token space through a <FACE> token and a differentiable renderer. It trains as a model-based autoencoder from in-the-wild images, using a self-supervised loss that enforces 3DMM regression via image reconstruction and a text-prediction objective to preserve instruction-following, all without 3D annotations. The approach supports both image-to-3DMM and text-to-3DMM generation, as well as multi-modal instruction following, enabling 3D face reasoning in a generalist MM-LLM. Experiments show competitive monocular 3D face reconstruction against specialized methods and strong text-driven 3D face generation, illustrating a viable path to integrating detailed 3D world knowledge with broad language capabilities in a fully self-supervised regime.

Abstract

We introduce FaceGPT, a self-supervised learning framework for Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text. Typical 3D face reconstruction methods are specialized algorithms that lack semantic reasoning capabilities. FaceGPT overcomes this limitation by embedding the parameters of a 3D morphable face model (3DMM) into the token space of a VLM, enabling the generation of 3D faces from both textual and visual inputs. FaceGPT is trained in a self-supervised manner as a model-based autoencoder from in-the-wild images. In particular, the hidden state of LLM is projected into 3DMM parameters and subsequently rendered as 2D face image to guide the self-supervised learning process via image-based reconstruction. Without relying on expensive 3D annotations of human faces, FaceGPT obtains a detailed understanding about 3D human faces, while preserving the capacity to understand general user instructions. Our experiments demonstrate that FaceGPT not only achieves high-quality 3D face reconstructions but also retains the ability for general-purpose visual instruction following. Furthermore, FaceGPT learns fully self-supervised to generate 3D faces based on complex textual inputs, which opens a new direction in human face analysis.
Paper Structure (17 sections, 5 equations, 4 figures, 4 tables)

This paper contains 17 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We introduce FaceGPT, a large vision-language model that learns to produce 3D human faces (in terms of 3DMM parameters) in a fully-self-supervised manner. When prompted with face images and task-specific questions, FaceGPT can output a special <FACE> token of which the corresponding feature embedding (red) can be decoded into 3DMM parameters, that encode the face shape $\alpha$, expression $\delta$, the texture $\gamma$, the light $\phi$ and camera $c$ parameters. When decoded with a 3DMM and differentiable renderer, this enables a fully self-supervised learning via inverse rendering. FaceGPT can produce 3D human faces from text-only input (first row), as well as multi-modal input (second row) without requiring any 3D annotations, while retaining general chatting abilities (last row).
  • Figure 2: Architecture of FaceGPT. Our model consists of a multi-modal LLM, which includes a vision encoder, a vision projection layer, and the LLM itself, along with a 3DMM projection layer, denoted as $\sigma$, and the parametric Basel face model blanz1999morphable. During the training phase, we focus on training the $\sigma$ projection layer and fine-tuning the LLM, while keeping other components frozen. The network components are trained using a differentiable renderer to compute a self-supervised loss.
  • Figure 3: Qualitative results for 3D face reconstruction. Our approach allows for the regression of pose, shape, expression, skin reflectance, and illumination from a single monocular image, achieving quality comparable to recent state-of-the-art methods.
  • Figure 4: Qualitative results of text-to -3DMM synthesis comparing the performance of the baseline VLM LLaVA-key and FaceGPT. Our method can reason and generate faithful 3D faces using only textual description. We also present the realistic face image which matches the description as a reference. A 3DMM reconstruction of reference images estimated by FOCUS method li2023focus is also provided for reference. Our approach captures many facial details from the texts and produces much better visual results compared to the LLaVA-Key baseline method.