Table of Contents
Fetching ...

TEASER: Token Enhanced Spatial Modeling for Expressions Reconstruction

Yunfei Liu, Lei Zhu, Lijian Lin, Ye Zhu, Ailing Zhang, Yu Li

TL;DR

TEASER tackles single-image 3D facial expression reconstruction by introducing a multi-scale appearance tokenizer and a token-guided neural renderer that fuse implicit appearance tokens with explicit FLAME geometry. The method employs a token consistency objective and a pose-dependent landmark loss, enabling faithful, expressive 3D meshes while producing high-fidelity, aligned 2D renderings. Across quantitative and qualitative benchmarks, TEASER achieves state-of-the-art performance and demonstrates strong token interpretability, enabling applications in expression transfer and identity swapping. The approach reduces reliance on imperfect photometric supervision and offers a versatile framework for photo-realistic face editing and animation in-the-wild.

Abstract

3D facial reconstruction from a single in-the-wild image is a crucial task in human-centered computer vision tasks. While existing methods can recover accurate facial shapes, there remains significant space for improvement in fine-grained expression capture. Current approaches struggle with irregular mouth shapes, exaggerated expressions, and asymmetrical facial movements. We present TEASER (Token EnhAnced Spatial modeling for Expressions Reconstruction), which addresses these challenges and enhances 3D facial geometry performance. TEASER tackles two main limitations of existing methods: insufficient photometric loss for self-reconstruction and inaccurate localization of subtle expressions. We introduce a multi-scale tokenizer to extract facial appearance information. Combined with a neural renderer, these tokens provide precise geometric guidance for expression reconstruction. Furthermore, TEASER incorporates a pose-dependent landmark loss to further improve geometric performances. Our approach not only significantly enhances expression reconstruction quality but also offers interpretable tokens suitable for various downstream applications, such as photorealistic facial video driving, expression transfer, and identity swapping. Quantitative and qualitative experimental results across multiple datasets demonstrate that TEASER achieves state-of-the-art performance in precise expression reconstruction.

TEASER: Token Enhanced Spatial Modeling for Expressions Reconstruction

TL;DR

TEASER tackles single-image 3D facial expression reconstruction by introducing a multi-scale appearance tokenizer and a token-guided neural renderer that fuse implicit appearance tokens with explicit FLAME geometry. The method employs a token consistency objective and a pose-dependent landmark loss, enabling faithful, expressive 3D meshes while producing high-fidelity, aligned 2D renderings. Across quantitative and qualitative benchmarks, TEASER achieves state-of-the-art performance and demonstrates strong token interpretability, enabling applications in expression transfer and identity swapping. The approach reduces reliance on imperfect photometric supervision and offers a versatile framework for photo-realistic face editing and animation in-the-wild.

Abstract

3D facial reconstruction from a single in-the-wild image is a crucial task in human-centered computer vision tasks. While existing methods can recover accurate facial shapes, there remains significant space for improvement in fine-grained expression capture. Current approaches struggle with irregular mouth shapes, exaggerated expressions, and asymmetrical facial movements. We present TEASER (Token EnhAnced Spatial modeling for Expressions Reconstruction), which addresses these challenges and enhances 3D facial geometry performance. TEASER tackles two main limitations of existing methods: insufficient photometric loss for self-reconstruction and inaccurate localization of subtle expressions. We introduce a multi-scale tokenizer to extract facial appearance information. Combined with a neural renderer, these tokens provide precise geometric guidance for expression reconstruction. Furthermore, TEASER incorporates a pose-dependent landmark loss to further improve geometric performances. Our approach not only significantly enhances expression reconstruction quality but also offers interpretable tokens suitable for various downstream applications, such as photorealistic facial video driving, expression transfer, and identity swapping. Quantitative and qualitative experimental results across multiple datasets demonstrate that TEASER achieves state-of-the-art performance in precise expression reconstruction.

Paper Structure

This paper contains 28 sections, 13 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Given the input image (a), TEASER predicts hybrid parameters for 3D facial reconstruction. The explicit parameters can be used to reconstruct precise 3D facial expressions (b). The implicit parameters (i.e., appearance token) guides high-fidelity face images generation (c). TEASER can be easily adapted to various applications, e.g.,, expression modification, as shown in the top row of (d), or changing facial appearance through token swapping, as shown in the bottom row of (d).
  • Figure 2: The framework of our pipeline.
  • Figure 3: Visual comparison of 3D face reconstruction with SOTA methods.
  • Figure 4: Visual comparison of estimated expression and its corresponding reconstructed images.
  • Figure 5: Visual results of ablation study. Left: impact of token consistency loss. Middle: impact of region loss. Right: impact of proposed landmark loss.
  • ...and 9 more figures