Table of Contents
Fetching ...

ExFace: Expressive Facial Control for Humanoid Robots with Diffusion Transformers and Bootstrap Training

Dong Zhang, Jingwei Peng, Yuyang Jiao, Jiayuan Gu, Jingyi Yu, Jiahao Chen

TL;DR

ExFace tackles the challenge of translating rich human facial expressions into precise bionic robot motor commands in real time. It uses a conditional Diffusion Transformer with a bootstrap training strategy to map 55-dimensional human blendshape sequences to 33-dimensional robot motor controls, enabling smooth, natural expressions over 120-frame windows. The approach achieves higher accuracy and lower latency (approximately $0.15$ s) at $60$ FPS, validated on two robots (Micheal and Hobbs) and complemented by the ExFace dataset and cross-platform results. This work advances human–robot interaction by enabling real-time, high-fidelity facial mimicry and supports digital-human interactions through Media2Face integration.

Abstract

This paper presents a novel Expressive Facial Control (ExFace) method based on Diffusion Transformers, which achieves precise mapping from human facial blendshapes to bionic robot motor control. By incorporating an innovative model bootstrap training strategy, our approach not only generates high-quality facial expressions but also significantly improves accuracy and smoothness. Experimental results demonstrate that the proposed method outperforms previous methods in terms of accuracy, frame per second (FPS), and response time. Furthermore, we develop the ExFace dataset driven by human facial data. ExFace shows excellent real-time performance and natural expression rendering in applications such as robot performances and human-robot interactions, offering a new solution for bionic robot interaction.

ExFace: Expressive Facial Control for Humanoid Robots with Diffusion Transformers and Bootstrap Training

TL;DR

ExFace tackles the challenge of translating rich human facial expressions into precise bionic robot motor commands in real time. It uses a conditional Diffusion Transformer with a bootstrap training strategy to map 55-dimensional human blendshape sequences to 33-dimensional robot motor controls, enabling smooth, natural expressions over 120-frame windows. The approach achieves higher accuracy and lower latency (approximately s) at FPS, validated on two robots (Micheal and Hobbs) and complemented by the ExFace dataset and cross-platform results. This work advances human–robot interaction by enabling real-time, high-fidelity facial mimicry and supports digital-human interactions through Media2Face integration.

Abstract

This paper presents a novel Expressive Facial Control (ExFace) method based on Diffusion Transformers, which achieves precise mapping from human facial blendshapes to bionic robot motor control. By incorporating an innovative model bootstrap training strategy, our approach not only generates high-quality facial expressions but also significantly improves accuracy and smoothness. Experimental results demonstrate that the proposed method outperforms previous methods in terms of accuracy, frame per second (FPS), and response time. Furthermore, we develop the ExFace dataset driven by human facial data. ExFace shows excellent real-time performance and natural expression rendering in applications such as robot performances and human-robot interactions, offering a new solution for bionic robot interaction.

Paper Structure

This paper contains 12 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Micheal, the realistic humanoid robot, demonstrates its interactive capabilities by displaying a variety of facial expressions, with these motions driven by ExFace. Its dynamic range of expressions highlights the potential for natural, human-like interaction between robots and people.
  • Figure 2: Data collection. ARKit apple2025arkit accurately captures blendshape joshi2006learning data from both human and robot faces, providing a reliable foundation for subsequent calibration and mapping.
  • Figure 3: Overview of ExFace. ExFace is a system that maps human facial Blendshape data to robot motor control values using a Diffusion Transformer network. On the left side, we begin with initial training using random motor control values, followed by model bootstrap, where dynamic facial expression sequences are used to iteratively refine the model. On the right side, our transformer diffusion network architecture takes Blendshape data ($c$) as a condition input. In each iteration, the network progressively denoises the data, and the final output, $x_0$, is the motor control values.
  • Figure 4: ExFace Model Bootstrap Curve. This figure illustrates the performance improvement achieved through our iterative training strategy. It also compares the results of using random data interpolation with dynamic facial expression sequences.
  • Figure 5: Real Person-Driven Performance on Different Robots. This figure demonstrates the generalizability of our approach by showcasing real human-driven facial expressions across various robotic platforms.
  • ...and 1 more figures