Table of Contents
Fetching ...

Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos

Gengshan Yang, Andrea Bajcsy, Shunsuke Saito, Angjoo Kanazawa

TL;DR

The Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections, is presented and results on pets and human given monocular RGBD videos captured by a smartphone are demonstrated.

Abstract

We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections. Different from prior works that rely on marker-based tracking and multiview cameras, ATS learns natural behaviors of animal and human agents non-invasively through video observations recorded over a long time-span (e.g., a month) in a single environment. Modeling 3D behavior of an agent requires persistent 3D tracking (e.g., knowing which point corresponds to which) over a long time period. To obtain such data, we develop a coarse-to-fine registration method that tracks the agent and the camera over time through a canonical 3D space, resulting in a complete and persistent spacetime 4D representation. We then train a generative model of agent behaviors using paired data of perception and motion of an agent queried from the 4D reconstruction. ATS enables real-to-sim transfer from video recordings of an agent to an interactive behavior simulator. We demonstrate results on pets (e.g., cat, dog, bunny) and human given monocular RGBD videos captured by a smartphone.

Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos

TL;DR

The Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections, is presented and results on pets and human given monocular RGBD videos captured by a smartphone are demonstrated.

Abstract

We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections. Different from prior works that rely on marker-based tracking and multiview cameras, ATS learns natural behaviors of animal and human agents non-invasively through video observations recorded over a long time-span (e.g., a month) in a single environment. Modeling 3D behavior of an agent requires persistent 3D tracking (e.g., knowing which point corresponds to which) over a long time period. To obtain such data, we develop a coarse-to-fine registration method that tracks the agent and the camera over time through a canonical 3D space, resulting in a complete and persistent spacetime 4D representation. We then train a generative model of agent behaviors using paired data of perception and motion of an agent queried from the 4D reconstruction. ATS enables real-to-sim transfer from video recordings of an agent to an interactive behavior simulator. We demonstrate results on pets (e.g., cat, dog, bunny) and human given monocular RGBD videos captured by a smartphone.

Paper Structure

This paper contains 16 sections, 14 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Learning agent behavior from longitudinal casual video recordings. We answer the following question: can we simulate the behavior of an agent, by learning from casually-captured videos of the same agent recorded across a long period of time (e.g., a month)? A) We first reconstruct videos in 4D (3D & time), which includes the scene, the trajectory of the agent, and the trajectory of the observer (i.e., camera held by the observer). Such individual 4D reconstructions are registered across time, resulting in a complete and persistent 4D representation. B) Then we learn a model of the agent for interactive behavior generation. The behavior model explicitly reasons about goals, paths, and full body movements conditioned on the agent's ego-perception and past trajectory. Such an agent representation allows generation of novel scenarios through conditioning. For example, conditioned on different observer trajectories, the cat agent chooses to walk to the carpet, stays still while quivering his tail, or hide under the tray stand. Please see videos results in the supplement.
  • Figure 2: Pipeline for behavior generation. We encode egocentric information into a perception code $\omega$, conditioned on which we generate fully body motion in a hierarchical fashion. We start by generating goals ${\bf Z}$, then paths ${\bf P}$ and finally body poses ${\bf G}$. Each node is represented by the gradient of its log distribution, trained with denoising objectives (Eq. \ref{['eq:diffusion']}). Given ${\bf G}$, the full body motion of an agent can be computed via blend skinning (Eq. \ref{['eq:lbs']}).
  • Figure 3: Comparison on multi-video scene reconstruction. We show birds-eye-view rendering of the reconstructed scene using the bunny dataset. Compared to TotalRecon that does not register multiple videos, ATS produces higher-quality scene reconstruction. Neural localizer (NL) and featuremetric losses (FBA) are shown important for camera registration. Scene annealing is important for reconstructing a complete scene from partial video captures.
  • Figure 4: Analysis of conditioning signals. We show results of removing one conditioning signal at a time. Removing observer conditioning and past trajectory conditioning makes the sampled goals more spread out (e.g., regions both in front of the agent and behind the agent); removing the environment conditioning introduces infeasible goals that penetrate the ground and the walls.
  • Figure 5: Results of 4D reconstruction. Top: reference images and renderings. Background color represents correspondence. Colored blobs on the cat represent $B=25$ bones (e.g., head is represented by the yellow blob). The magenta colored lines represents reconstructed trajectories of each blob in the world space. Bottom: Bird's eye view of the reconstructed scene and agent trajectories, registered to the same scene coordinate. Each colored line represents a unique video sequence where boxes and spheres indicate the starting and the end location.
  • ...and 10 more figures