Virtual avatar generation models as world navigators

Sai Mandava

Virtual avatar generation models as world navigators

Sai Mandava

TL;DR

SABR-CLIMB introduces a video-conditioned diffusion transformer that learns to generate complete 3D avatar motion from environment videos, demonstrated in rock climbing. The approach leverages a frozen DINOv2 video encoder and a DiT-based diffusion backbone, trained on the NAV-22M dataset, to produce per-frame SMPL pose, shape, and camera parameters. Evaluations combine qualitative analyses of spatial, temporal, and depth understanding with quantitative metrics on trajectory and movement adherence, and scaling studies reveal data and model size effects, while acknowledging limitations in out-of-distribution scenarios and computational demands. The work points to potential applications in robotics, sports, and healthcare by enabling virtual avatars to learn and navigate complex real-world tasks from video data.

Abstract

We introduce SABR-CLIMB, a novel video model simulating human movement in rock climbing environments using a virtual avatar. Our diffusion transformer predicts the sample instead of noise in each diffusion step and ingests entire videos to output complete motion sequences. By leveraging a large proprietary dataset, NAV-22M, and substantial computational resources, we showcase a proof of concept for a system to train general-purpose virtual avatars for complex tasks in robotics, sports, and healthcare.

Virtual avatar generation models as world navigators

TL;DR

Abstract

Paper Structure (46 sections, 2 equations, 15 figures)

This paper contains 46 sections, 2 equations, 15 figures.

Introduction
Related Work
Human Motion Generation
Denoising diffusion probabilistic models (DDPMs)
Data Processing
Data Sources
Deduplication
Sanitation
Video Processing Pipeline
3D Pose Tracker
Avatar Navigation
Preliminaries
Body Model.
Camera.
SABR-CLIMB.
...and 31 more sections

Figures (15)

Figure 1: Indoor, blue route
Figure 2: Outdoor route
Figure 3: Overview of our data pipeline. Left: The data processing pipeline used to curate and prepare the NAV-22M dataset for training the SABR-CLIMB model. Right: Enhanced view of the the video processing pipeline.
Figure 4: Overview of the SABR-Climb-600M model.
Figure 5: The modified Diffusion Transformer (DiT) architecture. Embedded video frames are taken in as conditioning via cross attention while the input motion latent vectors are fed through spatio-temporal DiT blocks.
...and 10 more figures

Virtual avatar generation models as world navigators

TL;DR

Abstract

Virtual avatar generation models as world navigators

Authors

TL;DR

Abstract

Table of Contents

Figures (15)