Table of Contents
Fetching ...

REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning

Jihyun Lee, Weipeng Xu, Alexander Richard, Shih-En Wei, Shunsuke Saito, Shaojie Bai, Te-Li Wang, Minhyuk Sung, Tae-Kyun Kim, Jason Saragih

TL;DR

REWIND tackles real-time egocentric whole-body motion estimation by combining cascaded body-hand diffusion with a causal relative-temporal Transformer and diffusion distillation to enable single-step inference. It further enhances realism through exemplar-based identity conditioning, which leverages a small set of example poses of the target identity encoded via a shared network and AdaIN integration. The approach achieves state-of-the-art results on real and synthetic egocentric datasets while delivering real-time performance (over 30 FPS) and robust generalization to unseen motion lengths, outperforming prior baselines such as EgoWholeMocap and EgoPoseFormer. The work offers practical implications for driving photorealistic avatars and VR/AR applications, while acknowledging occasional self-penetration issues as a topic for future improvement.

Abstract

We present REWIND (Real-Time Egocentric Whole-Body Motion Diffusion), a one-step diffusion model for real-time, high-fidelity human motion estimation from egocentric image inputs. While an existing method for egocentric whole-body (i.e., body and hands) motion estimation is non-real-time and acausal due to diffusion-based iterative motion refinement to capture correlations between body and hand poses, REWIND operates in a fully causal and real-time manner. To enable real-time inference, we introduce (1) cascaded body-hand denoising diffusion, which effectively models the correlation between egocentric body and hand motions in a fast, feed-forward manner, and (2) diffusion distillation, which enables high-quality motion estimation with a single denoising step. Our denoising diffusion model is based on a modified Transformer architecture, designed to causally model output motions while enhancing generalizability to unseen motion lengths. Additionally, REWIND optionally supports identity-conditioned motion estimation when identity prior is available. To this end, we propose a novel identity conditioning method based on a small set of pose exemplars of the target identity, which further enhances motion estimation quality. Through extensive experiments, we demonstrate that REWIND significantly outperforms the existing baselines both with and without exemplar-based identity conditioning.

REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning

TL;DR

REWIND tackles real-time egocentric whole-body motion estimation by combining cascaded body-hand diffusion with a causal relative-temporal Transformer and diffusion distillation to enable single-step inference. It further enhances realism through exemplar-based identity conditioning, which leverages a small set of example poses of the target identity encoded via a shared network and AdaIN integration. The approach achieves state-of-the-art results on real and synthetic egocentric datasets while delivering real-time performance (over 30 FPS) and robust generalization to unseen motion lengths, outperforming prior baselines such as EgoWholeMocap and EgoPoseFormer. The work offers practical implications for driving photorealistic avatars and VR/AR applications, while acknowledging occasional self-penetration issues as a topic for future improvement.

Abstract

We present REWIND (Real-Time Egocentric Whole-Body Motion Diffusion), a one-step diffusion model for real-time, high-fidelity human motion estimation from egocentric image inputs. While an existing method for egocentric whole-body (i.e., body and hands) motion estimation is non-real-time and acausal due to diffusion-based iterative motion refinement to capture correlations between body and hand poses, REWIND operates in a fully causal and real-time manner. To enable real-time inference, we introduce (1) cascaded body-hand denoising diffusion, which effectively models the correlation between egocentric body and hand motions in a fast, feed-forward manner, and (2) diffusion distillation, which enables high-quality motion estimation with a single denoising step. Our denoising diffusion model is based on a modified Transformer architecture, designed to causally model output motions while enhancing generalizability to unseen motion lengths. Additionally, REWIND optionally supports identity-conditioned motion estimation when identity prior is available. To this end, we propose a novel identity conditioning method based on a small set of pose exemplars of the target identity, which further enhances motion estimation quality. Through extensive experiments, we demonstrate that REWIND significantly outperforms the existing baselines both with and without exemplar-based identity conditioning.

Paper Structure

This paper contains 30 sections, 10 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) Pipeline overview. Given a sequence of stereo egocentric images and camera poses, our diffusion model first estimates 3D body motion and then estimates 3D hand motion conditioned on the 3D upper body motion. Our motion estimation can be optionally conditioned on the exemplar-based identity prior when available (Sec. \ref{['subsec:personalized_motion']}). Through an optional inverse kinematics step (refer to the supplementary for details), our tracking results can be used to drive meshes or photorealistic avatars. (b) Attention comparisons. Compared to vanilla self-attention (i.e., acausal, global attention) commonly used in existing works, the proposed causal windowed attention conditioned on relative timesteps enhances generalization to unseen motion lengths (Sec. \ref{['subsubsec:network_relative_temporal_transformer']}).
  • Figure 2: Qualitative comparisons on the ColossusEgo dataset. While our framework estimates 3D keypoints, we also employ inverse kinematics with per-identity meshes for more effective visual comparisons (refer to the supplementary material for details). Our method estimates significantly more accurate and natural motions compared to the existing state-of-the-art methods yang2024egoposeformerwang2024egocentric. The additional exemplar-based identity prior further enhances motion accuracy.
  • Figure 3: Qualitative comparisons on the UnrealEgo dataset akada2022unrealegoakada20243d. Red represents the ground truth skeleton, while blue represents the predicted skeleton. Our method estimates more accurate motions compared to the existing baselines wang2024egocentricyang2024egoposeformer.