When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning
Tri Tung Nguyen Nguyen, Quang Tien Dam, Dinh Tuan Tran, Joo-Ho Lee
TL;DR
This work tackles the challenge of predicting listening head motion in dyadic interactions by moving beyond dense continuous-to-discrete representations. It introduces Sparse Facial Motion Structure (SFMS), which encodes long 3DMM facial motion as a sparse sequence of $k$ keyframes interleaved with transition frames and uses an inpainting Transformer to recover intermediate states, enabling stable, high-fidelity reconstructions. The approach is paired with a multimodal listening head predictor that fuses speaker visuals and speech with a listener context in a Transformer framework, trained with a joint loss and a two-optimizer strategy. Empirical results on Learning2Listen and REACT demonstrate improved reconstruction accuracy and listening-head prediction quality, supported by qualitative assessments and subjective evaluations, underscoring SFMS’s potential for more natural and efficient human-robot interaction. The method also reveals broader applicability to related tasks such as micro-expression analysis and face verification by reducing information redundancy through sparsity.
Abstract
Effective human behavior modeling is critical for successful human-robot interaction. Current state-of-the-art approaches for predicting listening head behavior during dyadic conversations employ continuous-to-discrete representations, where continuous facial motion sequence is converted into discrete latent tokens. However, non-verbal facial motion presents unique challenges owing to its temporal variance and multi-modal nature. State-of-the-art discrete motion token representation struggles to capture underlying non-verbal facial patterns making training the listening head inefficient with low-fidelity generated motion. This study proposes a novel method for representing and predicting non-verbal facial motion by encoding long sequences into a sparse sequence of keyframes and transition frames. By identifying crucial motion steps and interpolating intermediate frames, our method preserves the temporal structure of motion while enhancing instance-wise diversity during the learning process. Additionally, we apply this novel sparse representation to the task of listening head prediction, demonstrating its contribution to improving the explanation of facial motion patterns.
