Virtual avatar generation models as world navigators
Sai Mandava
TL;DR
SABR-CLIMB introduces a video-conditioned diffusion transformer that learns to generate complete 3D avatar motion from environment videos, demonstrated in rock climbing. The approach leverages a frozen DINOv2 video encoder and a DiT-based diffusion backbone, trained on the NAV-22M dataset, to produce per-frame SMPL pose, shape, and camera parameters. Evaluations combine qualitative analyses of spatial, temporal, and depth understanding with quantitative metrics on trajectory and movement adherence, and scaling studies reveal data and model size effects, while acknowledging limitations in out-of-distribution scenarios and computational demands. The work points to potential applications in robotics, sports, and healthcare by enabling virtual avatars to learn and navigate complex real-world tasks from video data.
Abstract
We introduce SABR-CLIMB, a novel video model simulating human movement in rock climbing environments using a virtual avatar. Our diffusion transformer predicts the sample instead of noise in each diffusion step and ingests entire videos to output complete motion sequences. By leveraging a large proprietary dataset, NAV-22M, and substantial computational resources, we showcase a proof of concept for a system to train general-purpose virtual avatars for complex tasks in robotics, sports, and healthcare.
