Table of Contents
Fetching ...

RoMo: Robust Motion Segmentation Improves Structure from Motion

Lily Goli, Sara Sabour, Mark Matthews, Marcus Brubaker, Dmitry Lagun, Alec Jacobson, David J. Fleet, Saurabh Saxena, Andrea Tagliasacchi

TL;DR

RoMo tackles the challenge of motion segmentation in dynamic videos to bolster structure-from-motion (SfM) by introducing a zero-shot, iterative framework that fuses optical-flow cues, epipolar geometry, and a foundation-model segmentation (SAMv2) to produce dense, temporally consistent masks. The method alternates between weak epipolar supervision (robust fundamental matrix estimation and flow scoring) and a lightweight MLP classifier trained on high-level features, with two refinement iterations and a final high-resolution mask via SAMv2. RoMo achieves state-of-the-art performance on motion-segmentation benchmarks without supervision and delivers substantial improvements in camera calibration for dynamic scenes, including real-world data, through integration with COLMAP and related SfM tools. Additionally, the authors release the Casual Motion dataset to benchmark in-the-wild SfM with ground-truth camera trajectories, highlighting RoMo’s practical impact for robust 3D reconstruction in dynamic environments.

Abstract

There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure-from-motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM camera-calibration pipelines. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.

RoMo: Robust Motion Segmentation Improves Structure from Motion

TL;DR

RoMo tackles the challenge of motion segmentation in dynamic videos to bolster structure-from-motion (SfM) by introducing a zero-shot, iterative framework that fuses optical-flow cues, epipolar geometry, and a foundation-model segmentation (SAMv2) to produce dense, temporally consistent masks. The method alternates between weak epipolar supervision (robust fundamental matrix estimation and flow scoring) and a lightweight MLP classifier trained on high-level features, with two refinement iterations and a final high-resolution mask via SAMv2. RoMo achieves state-of-the-art performance on motion-segmentation benchmarks without supervision and delivers substantial improvements in camera calibration for dynamic scenes, including real-world data, through integration with COLMAP and related SfM tools. Additionally, the authors release the Casual Motion dataset to benchmark in-the-wild SfM with ground-truth camera trajectories, highlighting RoMo’s practical impact for robust 3D reconstruction in dynamic environments.

Abstract

There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure-from-motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM camera-calibration pipelines. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.

Paper Structure

This paper contains 37 sections, 6 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: We introduce a zero-shot motion segmentation method for video based on cues from epipolar geometry (top right) and optical flow. Our predicted masks (bottom left) can help improve SfM camera calibration on highly dynamic scenes (bottom right).
  • Figure 2: Epipolar matches (\ref{['sec:epipolar']}) --$\mathbf{U}_{t}$ and $\mathbf{L}_{t}$ respectively capture the most likely dynamic and static parts of the scene.
  • Figure 3: Feature-based classifier (\ref{['sec:classifier']}) -- Feature space of foundation models show strong objectness prior as shown by the first three PCA components of the features. We leverage these features to train our classifier on sparse and noisy labels from epipolar supervision, generating coherent motion masks.
  • Figure 4: Iterative refinement (\ref{['sec:iterative']}) -- Repeated fundamental matrix estimation and motion prediction improves estimated camera pose and masks, often converging after 2 iterations.
  • Figure 5: Final refinement (\ref{['sec:spatial_ref']}) -- With SAMv2 we improve the fine-grained details in the mask. In particular, note the finer details around the fingers and the dress frills.
  • ...and 11 more figures