B2F: End-to-End Body-to-Face Motion Generation with Style Reference
Bokyung Jang, Eunho Jung, Yoonsang Lee
TL;DR
B2F tackles the problem of generating facial motions that are semantically aligned with body motions while honoring a user-provided style reference. The method decomposes facial motion into content and discrete-style factors, learning this with end-to-end, alignment-driven objectives and a Gumbel-Softmax-based style space. The architecture combines Transformer-based encoders for body content, facial content, and style with a Transformer decoder-based facial motion generator, producing FLAME-formatted output and an ARKit converter for broader applicability. Perceptual studies and real-time benchmarks show that style-consistent, aligned facial expressions enhance perceived expressiveness and engagement, and the system can operate in real time with diverse characters. Limitations include missing eye-blinking dynamics, with future work aimed at natural-language style control and broader avatar support.
Abstract
Human motion naturally integrates body movements and facial expressions, forming a unified perception. If a virtual character's facial expression does not align well with its body movements, it may weaken the perception of the character as a cohesive whole. Motivated by this, we propose B2F, a model that generates facial motions aligned with body movements. B2F takes a facial style reference as input, generating facial animations that reflect the provided style while maintaining consistency with the associated body motion. To achieve this, B2F learns a disentangled representation of content and style, using alignment and consistency-based objectives. We represent style using discrete latent codes learned via the Gumbel-Softmax trick, enabling diverse expression generation with a structured latent representation. B2F outputs facial motion in the FLAME format, making it compatible with SMPL-X characters, and supports ARKit-style avatars through a dedicated conversion module. Our evaluations show that B2F generates expressive and engaging facial animations that synchronize with body movements and style intent, while mitigating perceptual dissonance from mismatched cues, and generalizing across diverse characters and styles.
