FaceGPT: Self-supervised Learning to Chat about 3D Human Faces
Haoran Wang, Mohit Mendiratta, Christian Theobalt, Adam Kortylewski
TL;DR
FaceGPT introduces a self-supervised framework that endows a Large Vision-Language Model with 3D face reasoning by embedding 3DMM parameters $\theta=[\alpha, \delta, \gamma, \phi, c]$ into the token space through a <FACE> token and a differentiable renderer. It trains as a model-based autoencoder from in-the-wild images, using a self-supervised loss that enforces 3DMM regression via image reconstruction and a text-prediction objective to preserve instruction-following, all without 3D annotations. The approach supports both image-to-3DMM and text-to-3DMM generation, as well as multi-modal instruction following, enabling 3D face reasoning in a generalist MM-LLM. Experiments show competitive monocular 3D face reconstruction against specialized methods and strong text-driven 3D face generation, illustrating a viable path to integrating detailed 3D world knowledge with broad language capabilities in a fully self-supervised regime.
Abstract
We introduce FaceGPT, a self-supervised learning framework for Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text. Typical 3D face reconstruction methods are specialized algorithms that lack semantic reasoning capabilities. FaceGPT overcomes this limitation by embedding the parameters of a 3D morphable face model (3DMM) into the token space of a VLM, enabling the generation of 3D faces from both textual and visual inputs. FaceGPT is trained in a self-supervised manner as a model-based autoencoder from in-the-wild images. In particular, the hidden state of LLM is projected into 3DMM parameters and subsequently rendered as 2D face image to guide the self-supervised learning process via image-based reconstruction. Without relying on expensive 3D annotations of human faces, FaceGPT obtains a detailed understanding about 3D human faces, while preserving the capacity to understand general user instructions. Our experiments demonstrate that FaceGPT not only achieves high-quality 3D face reconstructions but also retains the ability for general-purpose visual instruction following. Furthermore, FaceGPT learns fully self-supervised to generate 3D faces based on complex textual inputs, which opens a new direction in human face analysis.
