Table of Contents
Fetching ...

SIRE: SE(3) Intrinsic Rigidity Embeddings

Cameron Smith, Basile Van Hoorick, Vitor Guizilini, Yue Wang

TL;DR

Dynamic scenes with multiple independently moving rigid bodies pose challenges for traditional SfM and 2D segmentation. SIRELearn introduces SE($3$) Intrinsic Rigidity Embeddings, predicting depth and $M$-dimensional rigidity embeddings from each frame and training with a 4D reconstruction loss that lifts 2D point tracks into SE($3$) trajectories, supervised by reprojected 2D tracks. The method supports global SE($3$) motion for static parts and local per-track SE($3$) motions for dynamic components, with rigidity masks derived from embeddings to softly segment motion groups. It demonstrates strong data efficiency and versatility across downstream segmentation, self-supervised depth estimation, and SE($3$) trajectory estimation, using either per-video optimization or dataset-scale pretraining. This differentiable framework enables learning robust priors over geometry and object rigidity from real-world video, with clear implications for 3D scene understanding in vision, robotics, and related fields.

Abstract

Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least-squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re-projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene-specific structure - highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self-supervised depth estimation. Our findings suggest that SIRE can learn strong geometry and motion rigidity priors from video data, with minimal supervision.

SIRE: SE(3) Intrinsic Rigidity Embeddings

TL;DR

Dynamic scenes with multiple independently moving rigid bodies pose challenges for traditional SfM and 2D segmentation. SIRELearn introduces SE() Intrinsic Rigidity Embeddings, predicting depth and -dimensional rigidity embeddings from each frame and training with a 4D reconstruction loss that lifts 2D point tracks into SE() trajectories, supervised by reprojected 2D tracks. The method supports global SE() motion for static parts and local per-track SE() motions for dynamic components, with rigidity masks derived from embeddings to softly segment motion groups. It demonstrates strong data efficiency and versatility across downstream segmentation, self-supervised depth estimation, and SE() trajectory estimation, using either per-video optimization or dataset-scale pretraining. This differentiable framework enables learning robust priors over geometry and object rigidity from real-world video, with clear implications for 3D scene understanding in vision, robotics, and related fields.

Abstract

Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least-squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re-projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene-specific structure - highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self-supervised depth estimation. Our findings suggest that SIRE can learn strong geometry and motion rigidity priors from video data, with minimal supervision.

Paper Structure

This paper contains 18 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: SIRE is an end-to-end differentiable method for learning the underlying 3D rigid scene structure from 2D videos. Intrinsic rigid embeddings softly encode scene rigidity -- two points belong to the same rigid body if they have similar features. Our method is supervised just via 2D point tracks and can be trained on a dataset of videos to learn generalizable network priors or even on a single video.
  • Figure 2: A SIRE training forward pass. Given two frames and 2D point tracks, a per-frame CNN first estimates depth and rigidity embeddings. We lift the 2D point tracks into 3D scene flow via the depth estimates. For each point track, we extract a rigidity map by comparing its rigidity embedding to all other point track rigidity embeddings, then solve for the SE(3) transformation on the global scene flow using the rigidity map as confidence weights in the solver. Lastly, the per-track SE(3) trajectories are reprojected back to 2D and compared with the original 2D trajectory for supervision.
  • Figure 3: Results on the CO3D-Dogs Dataset. Here we plot estimated rigidity embeddings, depth estimates, accumulated 4D point clouds, color-coded SE(3) rotation and translation components, and highlighted rigidity maps. Observe how SE(3) components within rigid bodies are often constant. Rigidity maps are estimated per-track and we manually pick a few representative ones here; consider how they yield semantically meaningful soft segmentations. Note that we additionally perform a short per-scene fine-tuning step on top of the generalizable estimates to improve the results.
  • Figure 4: Rigidity Response Grid. Here we plot, for three scenes, the rigidity response grids -- where each cell contains a rigidity map from the point track at that location to all other point tracks. While we only plot a 16x16 grid here for space constraints, note we in practice use 64x64 grids of point tracks. We also include (top) the per-track images of RGB, SE(3) rotation (Euler angles) $\phi$ and translation vector $\tau$, and rigidity embeddings, and (bottom) manually highlighted rigidity maps on the bottom row. The center example (bear) clearly shows that the largest component (blue) corresponds to the background (camera movement), while point tracks on the bear's leg (red) form another distinct group (right leg movement). Note that these are results of per-scene optimizations without depth supervision.
  • Figure 5: Downstream Segmentation Plots. We demonstrate that our method's embeddings are useful for downstream moving object segmentation by freezing features from our model and baselines and training a two-layer MLP for segmentation. We compare using features from our rigidity embeddings, the last layer of our trained feature backbone, and the feature backbone before our training. Note these embeddings are trained from single-scene optimizations on each of these videos and without depth supervision. For each method, we visualize the PCA of their features (top) and their segmentations (bottom).
  • ...and 2 more figures