Table of Contents
Fetching ...

MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps

Jiahui Lei, Kyle Genova, George Kopanas, Noah Snavely, Leonidas Guibas

TL;DR

MoMaps provide a pixel-aligned, dense 3D motion representation that disentangles camera motion and enables leveraging large pre-trained image diffusion models for long-range 3D motion generation. The authors build a large MoMap database from real videos (over $50{,}000$) and train a diffusion model to forecast per-pixel 3D trajectories conditioned on scene view semantics and language prompts, plus a practical 2D video synthesis pipeline via rendering and completion. A vision-language conditioned control framework using a domain-specific language further enhances semantic controllability. Experiments show plausible, semantically coherent 3D scene motion and improved video synthesis quality, validating the value of explicit 3D motion priors for AR, robotics, and related tasks, while highlighting avenues for future work on multi-MoMap generation and finer motion control.

Abstract

This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image. We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.

MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps

TL;DR

MoMaps provide a pixel-aligned, dense 3D motion representation that disentangles camera motion and enables leveraging large pre-trained image diffusion models for long-range 3D motion generation. The authors build a large MoMap database from real videos (over ) and train a diffusion model to forecast per-pixel 3D trajectories conditioned on scene view semantics and language prompts, plus a practical 2D video synthesis pipeline via rendering and completion. A vision-language conditioned control framework using a domain-specific language further enhances semantic controllability. Experiments show plausible, semantically coherent 3D scene motion and improved video synthesis quality, validating the value of explicit 3D motion priors for AR, robotics, and related tasks, while highlighting avenues for future work on multi-MoMap generation and finer motion control.

Abstract

This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image. We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.

Paper Structure

This paper contains 23 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: (A) Given the first time frame color and segmentation image and a text prompt, our model generates the future dynamic 3D scene. (B) 3D dynamic scenes as pixel-aligned curve/trajectory images, re-purposing an image diffusion model for 4D generation.
  • Figure 2: Motion Maps: (A) Dynamic 3D scenes can be represented as one or more Motion Maps -- curve/trajectory images. (B) We develop a full-stack data pipeline to recover a large dataset of MoMaps from many real videos.
  • Figure 3: Method Overview: (A) A MoMap can be compressed to compact latent via initializing and finetuning a Stable-Diffusion VAE. (B) Given a starting frame and language condition, MoMap can be generated by finetuning the SD UNet.
  • Figure 4: Application: (A) 2D video generation via render MoMap and then complete; (B-1) motion DSL representation. (B-2) Infer motion DSL with VLM for finer generation control.
  • Figure 5: Qualitative Results: (A) comparison with baselines. (B) More generation results. (C) Diverse generations from the same condition with different random seeds.
  • ...and 1 more figures