Table of Contents
Fetching ...

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, Aleksander Holynski

TL;DR

The paper tackles the scarcity of ground-truth dynamic 3D data by mining 4D reconstructions from internet VR180 stereo videos to produce pseudo-metric point clouds with long-term trajectories. It introduces Stereo4D, a data-processing pipeline that fuses camera poses, stereo depth, and 2D tracks, and then uses this data to train Dyna-DUSt3R, a dynamic extension of DUSt3R that predicts 3D structure and motion between frames. Empirically, models trained on Stereo4D demonstrate superior generalization to real-world dynamic scenes and yield more accurate 3D motion and structure than synthetic baselines. The work provides a scalable route to learn dynamic 3D priors from diverse real-world content, with broad implications for robotics, scene understanding, and 3D reconstruction.

Abstract

Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has enabled rapid progress, directly supervising methods for recovering 3D motion remains challenging due to the fundamental difficulty of obtaining ground truth annotations. We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. Our system fuses and filters the outputs of camera pose estimation, stereo depth estimation, and temporal tracking methods into high-quality dynamic 3D reconstructions. We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds with long-term motion trajectories. We demonstrate the utility of this data by training a variant of DUSt3R to predict structure and 3D motion from real-world image pairs, showing that training on our reconstructed data enables generalization to diverse real-world scenes. Project page and data at: https://stereo4d.github.io

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

TL;DR

The paper tackles the scarcity of ground-truth dynamic 3D data by mining 4D reconstructions from internet VR180 stereo videos to produce pseudo-metric point clouds with long-term trajectories. It introduces Stereo4D, a data-processing pipeline that fuses camera poses, stereo depth, and 2D tracks, and then uses this data to train Dyna-DUSt3R, a dynamic extension of DUSt3R that predicts 3D structure and motion between frames. Empirically, models trained on Stereo4D demonstrate superior generalization to real-world dynamic scenes and yield more accurate 3D motion and structure than synthetic baselines. The work provides a scalable route to learn dynamic 3D priors from diverse real-world content, with broad implications for robotics, scene understanding, and 3D reconstruction.

Abstract

Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has enabled rapid progress, directly supervising methods for recovering 3D motion remains challenging due to the fundamental difficulty of obtaining ground truth annotations. We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. Our system fuses and filters the outputs of camera pose estimation, stereo depth estimation, and temporal tracking methods into high-quality dynamic 3D reconstructions. We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds with long-term motion trajectories. We demonstrate the utility of this data by training a variant of DUSt3R to predict structure and 3D motion from real-world image pairs, showing that training on our reconstructed data enables generalization to diverse real-world scenes. Project page and data at: https://stereo4d.github.io

Paper Structure

This paper contains 22 sections, 10 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: There is currently no scalable source of data for real-world, ground truth 3D motion paired with video. We present a framework for mining such data from existing stereoscopic videos on the Internet, in the form of 3D point clouds with long-range world-space trajectories. Our framework fuses and filters camera poses, dense depth maps, and 2D motion trajectories to produce high-quality, pseudo-metric point clouds with long-term 3D motion trajectories, pictured above, for hundreds of thousands of video clips. We show how this data is useful in learning a model that reasons about both 3D shape and motion in imagery.
  • Figure 2: Data processing pipeline. Our method starts with VR180 (wide-angle, stereoscopic) videos, and estimates metric stereo depth, 2D point tracks, and camera poses. These quantities allow the tracks to be lifted to 3D where they are filtered and denoised to produce world-space, metric 3D point trajectories.
  • Figure 3: Effect of track optimization. Comparing motion trajectories before and after track optimization, we see that optimization resolves the high-frequency jitter along camera rays, affecting both static and dynamic content. After optimization, static content has static tracks, and dynamic tracks are less noisy.
  • Figure 4: Diverse motion: Stereo4D captures a wide variety of types of moving objects: swimming fish, walking pedestrians, moving vehicles, a farmer sowing seeds, etc. It includes source videos captured with both stationary (left) and moving (right) cameras.
  • Figure 5: Diverse scene content: A word cloud of captioned frames from our dataset shows our data is diverse, including a variety of common objects seen in videos.
  • ...and 12 more figures