VividFace: Real-Time and Realistic Facial Expression Shadowing for Humanoid Robots
Peizhen Li, Longbing Cao, Xiao-Ming Wu, Yang Zhang
TL;DR
VividFace tackles the challenge of real-time, realistic facial expression shadowing for humanoid robots by integrating an optimized two-stage imitation framework (M1 for motion transfer and M2 for mapping to control values) with a video-streaming pipeline and domain-adaptive training. Fine-tuning the motion transfer module on humanoid data using GAN-based image reconstruction (forming X2CNet++) and introducing a feature-adaptation strategy for the mapping network bridge the gap between training and inference domains, enabling preservation of subtle details like wrinkles and gaze. The system achieves end-to-end latency of about $0.05$ s and outperforms baselines on realism metrics (AUR and MAID) across real-world demonstrations on Ameca, validated by ablation studies and latency measurements. These results demonstrate the practical utility of VividFace for natural, responsive human–robot interaction and highlight avenues for scaling to multi-person scenarios and more compact architectures.
Abstract
Humanoid facial expression shadowing enables robots to realistically imitate human facial expressions in real time, which is critical for lifelike, facially expressive humanoid robots and affective human-robot interaction. Existing progress in humanoid facial expression imitation remains limited, often failing to achieve either real-time performance or realistic expressiveness due to offline video-based inference designs and insufficient ability to capture and transfer subtle expression details. To address these limitations, we present VividFace, a real-time and realistic facial expression shadowing system for humanoid robots. An optimized imitation framework X2CNet++ enhances expressiveness by fine-tuning the human-to-humanoid facial motion transfer module and introducing a feature-adaptation training strategy for better alignment across different image sources. Real-time shadowing is further enabled by a video-stream-compatible inference pipeline and a streamlined workflow based on asynchronous I/O for efficient communication across devices. VividFace produces vivid humanoid faces by mimicking human facial expressions within 0.05 seconds, while generalizing across diverse facial configurations. Extensive real-world demonstrations validate its practical utility. Videos are available at: https://lipzh5.github.io/VividFace/.
