Table of Contents
Fetching ...

B2F: End-to-End Body-to-Face Motion Generation with Style Reference

Bokyung Jang, Eunho Jung, Yoonsang Lee

TL;DR

B2F tackles the problem of generating facial motions that are semantically aligned with body motions while honoring a user-provided style reference. The method decomposes facial motion into content and discrete-style factors, learning this with end-to-end, alignment-driven objectives and a Gumbel-Softmax-based style space. The architecture combines Transformer-based encoders for body content, facial content, and style with a Transformer decoder-based facial motion generator, producing FLAME-formatted output and an ARKit converter for broader applicability. Perceptual studies and real-time benchmarks show that style-consistent, aligned facial expressions enhance perceived expressiveness and engagement, and the system can operate in real time with diverse characters. Limitations include missing eye-blinking dynamics, with future work aimed at natural-language style control and broader avatar support.

Abstract

Human motion naturally integrates body movements and facial expressions, forming a unified perception. If a virtual character's facial expression does not align well with its body movements, it may weaken the perception of the character as a cohesive whole. Motivated by this, we propose B2F, a model that generates facial motions aligned with body movements. B2F takes a facial style reference as input, generating facial animations that reflect the provided style while maintaining consistency with the associated body motion. To achieve this, B2F learns a disentangled representation of content and style, using alignment and consistency-based objectives. We represent style using discrete latent codes learned via the Gumbel-Softmax trick, enabling diverse expression generation with a structured latent representation. B2F outputs facial motion in the FLAME format, making it compatible with SMPL-X characters, and supports ARKit-style avatars through a dedicated conversion module. Our evaluations show that B2F generates expressive and engaging facial animations that synchronize with body movements and style intent, while mitigating perceptual dissonance from mismatched cues, and generalizing across diverse characters and styles.

B2F: End-to-End Body-to-Face Motion Generation with Style Reference

TL;DR

B2F tackles the problem of generating facial motions that are semantically aligned with body motions while honoring a user-provided style reference. The method decomposes facial motion into content and discrete-style factors, learning this with end-to-end, alignment-driven objectives and a Gumbel-Softmax-based style space. The architecture combines Transformer-based encoders for body content, facial content, and style with a Transformer decoder-based facial motion generator, producing FLAME-formatted output and an ARKit converter for broader applicability. Perceptual studies and real-time benchmarks show that style-consistent, aligned facial expressions enhance perceived expressiveness and engagement, and the system can operate in real time with diverse characters. Limitations include missing eye-blinking dynamics, with future work aimed at natural-language style control and broader avatar support.

Abstract

Human motion naturally integrates body movements and facial expressions, forming a unified perception. If a virtual character's facial expression does not align well with its body movements, it may weaken the perception of the character as a cohesive whole. Motivated by this, we propose B2F, a model that generates facial motions aligned with body movements. B2F takes a facial style reference as input, generating facial animations that reflect the provided style while maintaining consistency with the associated body motion. To achieve this, B2F learns a disentangled representation of content and style, using alignment and consistency-based objectives. We represent style using discrete latent codes learned via the Gumbel-Softmax trick, enabling diverse expression generation with a structured latent representation. B2F outputs facial motion in the FLAME format, making it compatible with SMPL-X characters, and supports ARKit-style avatars through a dedicated conversion module. Our evaluations show that B2F generates expressive and engaging facial animations that synchronize with body movements and style intent, while mitigating perceptual dissonance from mismatched cues, and generalizing across diverse characters and styles.

Paper Structure

This paper contains 26 sections, 12 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: B2F architecture. The B2F model consists of a body content encoder $\mathcal{E}_b$, facial content encoder $\mathcal{E}_f$, facial style encoder $\mathcal{E}_s$, and facial motion generator $\mathcal{G}$. During training, it uses body and facial content motions along with a facial style reference to generate facial motion. At inference, B2F generates facial motion using only the body content motion and style reference.
  • Figure 2: Loss computation. $\mathcal{L}_{\text{recon}}$, $\mathcal{L}_{\text{align}}$, $\mathcal{L}_{\text{KL}}$ and $\mathcal{L}_{\text{consi}}$ are calculated using the content and style input extracted from the same whole-body motion segment, while $\mathcal{L}_{\text{cross}}$ is calculated using the content inputs from one motion segment and the style input from a different segment.
  • Figure 3: Extensions for various animated characters.
  • Figure 4: Results for various content and style inputs.
  • Figure 5: Results for various animated characters.
  • ...and 11 more figures