Driving Animatronic Robot Facial Expression From Speech

Boren Li; Hang Li; Hangxin Liu

Driving Animatronic Robot Facial Expression From Speech

Boren Li, Hang Li, Hangxin Liu

TL;DR

This work tackles the problem of generating speech-synchronized, lifelike facial expressions on an animatronic robot. It introduces a skinning-centric framework centered on linear blend skinning (LBS) to unify embodiment design and motion synthesis, including an LBS-based actuation topology and a learning-based speech-to-blendshape pipeline. The system achieves real-time performance exceeding 4000 fps on an NVIDIA RTX 4090, with millimeter-scale tracking verified by VICON and a VOCASET-based training regime using a Wav2vec2-based speech encoder; an LBS decoder achieves consistent, lip-synchronized expressions during inference. A blind user study supports the perceptual naturalness of generated motions, and the authors release their code to advance research in speech-driven animatronic facial expression generation.

Abstract

Animatronic robots hold the promise of enabling natural human-robot interaction through lifelike facial expressions. However, generating realistic, speech-synchronized robot expressions poses significant challenges due to the complexities of facial biomechanics and the need for responsive motion synthesis. This paper introduces a novel, skinning-centric approach to drive animatronic robot facial expressions from speech input. At its core, the proposed approach employs linear blend skinning (LBS) as a unifying representation, guiding innovations in both embodiment design and motion synthesis. LBS informs the actuation topology, facilitates human expression retargeting, and enables efficient speech-driven facial motion generation. This approach demonstrates the capability to produce highly realistic facial expressions on an animatronic face in real-time at over 4000 fps on a single Nvidia RTX 4090, significantly advancing robots' ability to replicate nuanced human expressions for natural interaction. To foster further research and development in this field, the code has been made publicly available at: \url{https://github.com/library87/OpenRoboExp}.

Driving Animatronic Robot Facial Expression From Speech

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 6 figures)

This paper contains 17 sections, 3 equations, 6 figures.

introduction
Related Works
Proposed Approach
Approach Overview
LBS Representation
Skinning-oriented Robot Development
LBS-Oriented Kinematics Design
Electro-Mechanical Design and Development
Skinning Motion Imitation Learning
Model Architecture
Training
Inference
Implementation Details
Experiments
Robot Development Experiments
...and 2 more sections

Figures (6)

Figure 1: Dynamic animatronic robot facial expressions generated from speech. The figure shows the system's capability to produce diverse and lifelike facial expressions in real-time, synchronized with the corresponding audio speech input. The waveform at the top represents the audio input, while the series of images below showcase the robot's facial responses at different time points.
Figure 2: The proposed approach for creating a speech-driven animatronic robot face using LBS. The approach comprises three major components: (1) skinning-oriented robot development designs and constructs the animatronic face paired with a kinematics simulator based on the target skinning appearance, (2) skinning motion imitation learning involves training an LBS-based model from 3D human demonstrations to generate facial expressions from speech input, and (3) speech-driven robot orchestration generates animatronic facial expressions during inference by utilizing the developed platform, simulator, and learned model. The diagram highlights key development steps, outputs, and inference processes, demonstrating the end-to-end workflow from concept to final animatable robot face.
Figure 3: The proposed skinning-oriented robot design. The figure comprises two primary components: (1) LBS-oriented kinematics design, which showcases the facial mesh model with strategically placed control points for various facial features to achieve actuation topology for the facial muscular system that matches the designed LBS motion space and references facial anatomy, and (2) electro-mechanical design and development accounting for physical constraints of the embodiment, including key mechanical components of the skin, skeleton and muscular system, as well as the electrical control system. This comprehensive view demonstrates how the theoretical LBS model is translated into a functional, physically embodied animatronic face.
Figure 4: The proposed speech-driven facial skinning motion imitation learning method. The model architecture (blue section) comprises three key components: (1) a frame-level speech encoder that processes audio input and generates phoneme logits, (2) a speaking style encoder that captures individual speaking styles, and (3) an LBS encoder that generates blendshape coefficients. During training (red section), the model learns to imitate human facial skinning motions by minimizing the difference between generated and target expressions. In the inference branch (orange section), the trained model generates blendshape coefficients for the robot LBS decoder, producing robot-specific facial skinning motions as reference signals for the downstream kinematics simulator.
Figure 5: Motion Space Validation.Actuated blendshape error for different facial regions (left figure): Color-coded skinning landmarks represent different facial regions for evaluation. The 3D face model shows color-coded landmarks for different facial areas. Error distributions between simulated and physically actuated blendshapes are visualized using violin plots, box plots, and scattered points. Each point represents a single blendshape, evaluated using region-specific landmarks. Median errors (in mm) are provided for each facial region, ranging from $1.76\mathrm{mm}$ (nose) to $8.63\mathrm{mm}$ (jaw). Qualitative comparison (right figure): Visual comparison of eight simulated (gray mesh) versus actuated (realistic skin) blendshapes are shown. Blendshapes (1)-(6) demonstrate high accuracy, while (7) mouth close and (8) jaw open highlight limitations in the current design, exhibiting maximum errors for their respective regions.
...and 1 more figures

Driving Animatronic Robot Facial Expression From Speech

TL;DR

Abstract

Driving Animatronic Robot Facial Expression From Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (6)