REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning
Jihyun Lee, Weipeng Xu, Alexander Richard, Shih-En Wei, Shunsuke Saito, Shaojie Bai, Te-Li Wang, Minhyuk Sung, Tae-Kyun Kim, Jason Saragih
TL;DR
REWIND tackles real-time egocentric whole-body motion estimation by combining cascaded body-hand diffusion with a causal relative-temporal Transformer and diffusion distillation to enable single-step inference. It further enhances realism through exemplar-based identity conditioning, which leverages a small set of example poses of the target identity encoded via a shared network and AdaIN integration. The approach achieves state-of-the-art results on real and synthetic egocentric datasets while delivering real-time performance (over 30 FPS) and robust generalization to unseen motion lengths, outperforming prior baselines such as EgoWholeMocap and EgoPoseFormer. The work offers practical implications for driving photorealistic avatars and VR/AR applications, while acknowledging occasional self-penetration issues as a topic for future improvement.
Abstract
We present REWIND (Real-Time Egocentric Whole-Body Motion Diffusion), a one-step diffusion model for real-time, high-fidelity human motion estimation from egocentric image inputs. While an existing method for egocentric whole-body (i.e., body and hands) motion estimation is non-real-time and acausal due to diffusion-based iterative motion refinement to capture correlations between body and hand poses, REWIND operates in a fully causal and real-time manner. To enable real-time inference, we introduce (1) cascaded body-hand denoising diffusion, which effectively models the correlation between egocentric body and hand motions in a fast, feed-forward manner, and (2) diffusion distillation, which enables high-quality motion estimation with a single denoising step. Our denoising diffusion model is based on a modified Transformer architecture, designed to causally model output motions while enhancing generalizability to unseen motion lengths. Additionally, REWIND optionally supports identity-conditioned motion estimation when identity prior is available. To this end, we propose a novel identity conditioning method based on a small set of pose exemplars of the target identity, which further enhances motion estimation quality. Through extensive experiments, we demonstrate that REWIND significantly outperforms the existing baselines both with and without exemplar-based identity conditioning.
