Table of Contents
Fetching ...

VividFace: Real-Time and Realistic Facial Expression Shadowing for Humanoid Robots

Peizhen Li, Longbing Cao, Xiao-Ming Wu, Yang Zhang

TL;DR

VividFace tackles the challenge of real-time, realistic facial expression shadowing for humanoid robots by integrating an optimized two-stage imitation framework (M1 for motion transfer and M2 for mapping to control values) with a video-streaming pipeline and domain-adaptive training. Fine-tuning the motion transfer module on humanoid data using GAN-based image reconstruction (forming X2CNet++) and introducing a feature-adaptation strategy for the mapping network bridge the gap between training and inference domains, enabling preservation of subtle details like wrinkles and gaze. The system achieves end-to-end latency of about $0.05$ s and outperforms baselines on realism metrics (AUR and MAID) across real-world demonstrations on Ameca, validated by ablation studies and latency measurements. These results demonstrate the practical utility of VividFace for natural, responsive human–robot interaction and highlight avenues for scaling to multi-person scenarios and more compact architectures.

Abstract

Humanoid facial expression shadowing enables robots to realistically imitate human facial expressions in real time, which is critical for lifelike, facially expressive humanoid robots and affective human-robot interaction. Existing progress in humanoid facial expression imitation remains limited, often failing to achieve either real-time performance or realistic expressiveness due to offline video-based inference designs and insufficient ability to capture and transfer subtle expression details. To address these limitations, we present VividFace, a real-time and realistic facial expression shadowing system for humanoid robots. An optimized imitation framework X2CNet++ enhances expressiveness by fine-tuning the human-to-humanoid facial motion transfer module and introducing a feature-adaptation training strategy for better alignment across different image sources. Real-time shadowing is further enabled by a video-stream-compatible inference pipeline and a streamlined workflow based on asynchronous I/O for efficient communication across devices. VividFace produces vivid humanoid faces by mimicking human facial expressions within 0.05 seconds, while generalizing across diverse facial configurations. Extensive real-world demonstrations validate its practical utility. Videos are available at: https://lipzh5.github.io/VividFace/.

VividFace: Real-Time and Realistic Facial Expression Shadowing for Humanoid Robots

TL;DR

VividFace tackles the challenge of real-time, realistic facial expression shadowing for humanoid robots by integrating an optimized two-stage imitation framework (M1 for motion transfer and M2 for mapping to control values) with a video-streaming pipeline and domain-adaptive training. Fine-tuning the motion transfer module on humanoid data using GAN-based image reconstruction (forming X2CNet++) and introducing a feature-adaptation strategy for the mapping network bridge the gap between training and inference domains, enabling preservation of subtle details like wrinkles and gaze. The system achieves end-to-end latency of about s and outperforms baselines on realism metrics (AUR and MAID) across real-world demonstrations on Ameca, validated by ablation studies and latency measurements. These results demonstrate the practical utility of VividFace for natural, responsive human–robot interaction and highlight avenues for scaling to multi-person scenarios and more compact architectures.

Abstract

Humanoid facial expression shadowing enables robots to realistically imitate human facial expressions in real time, which is critical for lifelike, facially expressive humanoid robots and affective human-robot interaction. Existing progress in humanoid facial expression imitation remains limited, often failing to achieve either real-time performance or realistic expressiveness due to offline video-based inference designs and insufficient ability to capture and transfer subtle expression details. To address these limitations, we present VividFace, a real-time and realistic facial expression shadowing system for humanoid robots. An optimized imitation framework X2CNet++ enhances expressiveness by fine-tuning the human-to-humanoid facial motion transfer module and introducing a feature-adaptation training strategy for better alignment across different image sources. Real-time shadowing is further enabled by a video-stream-compatible inference pipeline and a streamlined workflow based on asynchronous I/O for efficient communication across devices. VividFace produces vivid humanoid faces by mimicking human facial expressions within 0.05 seconds, while generalizing across diverse facial configurations. Extensive real-world demonstrations validate its practical utility. Videos are available at: https://lipzh5.github.io/VividFace/.
Paper Structure (16 sections, 12 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 12 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: A demonstration of VividFace. The humanoid robot faithfully imitates the facial expressions of the human performer in real time. The shadowing of subtle details, such as frowning, gaze direction, and head pose enhances realism.
  • Figure 2: An overview of the VividFace workflow. An RGB camera captures human facial expression dynamics (A), and the image frames (each frame denoted by $I_d$) are streamed to the server and processed by the imitation framework, which consists of the motion transfer module $\mathcal{M}_1$ and the mapping network $\mathcal{M}_2$. The motion transfer module produces an intermediate expression representation $I_m = \mathcal{M}_1(I_d; f_s, x_{c,s})$ that integrates human motion with a virtual robot face. The mapping network then predicts control values $\hat{\mathbf{y}} = \mathcal{M}_2(I_m)$, which are used to drive the physical robot to reproduce the expression (B). The intermediate data flow for three example frames is visualized on the right (C).
  • Figure 3: Comparison of nose wrinkle transfer in models with and without fine-tuning on the X2C dataset.
  • Figure 4: An illustration of the feature-adaptation training. $I_x$ refers to the image from the X2C dataset, while $\tilde{I}_x = \mathcal{M}_1(I_x)$ denotes its generated counterpart.
  • Figure 5: Real-world examples of humanoid robots performing realistic facial expression imitation.
  • ...and 3 more figures