Table of Contents
Fetching ...

When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning

Tri Tung Nguyen Nguyen, Quang Tien Dam, Dinh Tuan Tran, Joo-Ho Lee

TL;DR

This work tackles the challenge of predicting listening head motion in dyadic interactions by moving beyond dense continuous-to-discrete representations. It introduces Sparse Facial Motion Structure (SFMS), which encodes long 3DMM facial motion as a sparse sequence of $k$ keyframes interleaved with transition frames and uses an inpainting Transformer to recover intermediate states, enabling stable, high-fidelity reconstructions. The approach is paired with a multimodal listening head predictor that fuses speaker visuals and speech with a listener context in a Transformer framework, trained with a joint loss and a two-optimizer strategy. Empirical results on Learning2Listen and REACT demonstrate improved reconstruction accuracy and listening-head prediction quality, supported by qualitative assessments and subjective evaluations, underscoring SFMS’s potential for more natural and efficient human-robot interaction. The method also reveals broader applicability to related tasks such as micro-expression analysis and face verification by reducing information redundancy through sparsity.

Abstract

Effective human behavior modeling is critical for successful human-robot interaction. Current state-of-the-art approaches for predicting listening head behavior during dyadic conversations employ continuous-to-discrete representations, where continuous facial motion sequence is converted into discrete latent tokens. However, non-verbal facial motion presents unique challenges owing to its temporal variance and multi-modal nature. State-of-the-art discrete motion token representation struggles to capture underlying non-verbal facial patterns making training the listening head inefficient with low-fidelity generated motion. This study proposes a novel method for representing and predicting non-verbal facial motion by encoding long sequences into a sparse sequence of keyframes and transition frames. By identifying crucial motion steps and interpolating intermediate frames, our method preserves the temporal structure of motion while enhancing instance-wise diversity during the learning process. Additionally, we apply this novel sparse representation to the task of listening head prediction, demonstrating its contribution to improving the explanation of facial motion patterns.

When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning

TL;DR

This work tackles the challenge of predicting listening head motion in dyadic interactions by moving beyond dense continuous-to-discrete representations. It introduces Sparse Facial Motion Structure (SFMS), which encodes long 3DMM facial motion as a sparse sequence of keyframes interleaved with transition frames and uses an inpainting Transformer to recover intermediate states, enabling stable, high-fidelity reconstructions. The approach is paired with a multimodal listening head predictor that fuses speaker visuals and speech with a listener context in a Transformer framework, trained with a joint loss and a two-optimizer strategy. Empirical results on Learning2Listen and REACT demonstrate improved reconstruction accuracy and listening-head prediction quality, supported by qualitative assessments and subjective evaluations, underscoring SFMS’s potential for more natural and efficient human-robot interaction. The method also reveals broader applicability to related tasks such as micro-expression analysis and face verification by reducing information redundancy through sparsity.

Abstract

Effective human behavior modeling is critical for successful human-robot interaction. Current state-of-the-art approaches for predicting listening head behavior during dyadic conversations employ continuous-to-discrete representations, where continuous facial motion sequence is converted into discrete latent tokens. However, non-verbal facial motion presents unique challenges owing to its temporal variance and multi-modal nature. State-of-the-art discrete motion token representation struggles to capture underlying non-verbal facial patterns making training the listening head inefficient with low-fidelity generated motion. This study proposes a novel method for representing and predicting non-verbal facial motion by encoding long sequences into a sparse sequence of keyframes and transition frames. By identifying crucial motion steps and interpolating intermediate frames, our method preserves the temporal structure of motion while enhancing instance-wise diversity during the learning process. Additionally, we apply this novel sparse representation to the task of listening head prediction, demonstrating its contribution to improving the explanation of facial motion patterns.

Paper Structure

This paper contains 38 sections, 14 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: Listening head prediction with sparse token overview. Our sparse representation captures key time steps from the listener's facial motion sequence, encoding temporal scale-varied and compound non-verbal facial motions, which generalize more effectively to the listening prediction task. The predictor determines the target facial motion to produce for future time steps and coordinates the transition of incoming frames toward specified facial expressions. By modeling both transition and key motion states as discrete tokens, this approach combines the robustness, stability, and flexibility of both discrete and continuous generative modeling.
  • Figure 2: Training Pipeline Overview. Our model learns to represent continuous motion as discrete tokens of key and transition frames with enhanced accuracy and fidelity. The proposal comprises two phases: reconstruction (top) and listening head motion prediction (bottom). The reconstruction task (top) includes two sub-modules: the expression token learning and motion inpainting models. The expression token learning model encodes a facial motion sequence into a finite set of discrete tokens, while the motion infilling model interpolates the blanks between these tokens with intermediate states. The prediction phase utilizes the trained reconstruction module to predict future facial tokens in a next-token prediction task, where the model must decide whether to react with a transition state or a key state that interrupts the current motion.
  • Figure 3: Keyframe score learning and the reconstruction task. The workflow starts by encoding 3DMM facial motion features into ranking logit scores to identify keyframes. Masks are sampled using the $\text{top-k}$ and Gumbel-Softmax functions for motion reconstruction. Keyframes are represented as vector quantized tokens, while transition frames are encoded positionally with keyframe information. The decoder reconstructs the original facial motion by combining keyframe and transition frame embeddings. Finally, the best reconstruction from the samples is used as the target for keyframe feature learning.
  • Figure 4: Dense and Sparse Facial Motion Token Comparison. In contrast to the dense structure, where 3DMM expression codes are uniformly encoded, our sparse structure selects keyframes and their positions to reconstruct locally dependent transition frames. This approach captures macro-level expression details while reducing computational complexity and minimizing information loss during the continuous-to-discrete conversion between keyframes.
  • Figure 5: Predictor's architecture overview. A transformer-based predictor is employed for the next-token prediction task based on multi-modal input context in a dyadic conversation.
  • ...and 6 more figures