SIRE: SE(3) Intrinsic Rigidity Embeddings
Cameron Smith, Basile Van Hoorick, Vitor Guizilini, Yue Wang
TL;DR
Dynamic scenes with multiple independently moving rigid bodies pose challenges for traditional SfM and 2D segmentation. SIRELearn introduces SE($3$) Intrinsic Rigidity Embeddings, predicting depth and $M$-dimensional rigidity embeddings from each frame and training with a 4D reconstruction loss that lifts 2D point tracks into SE($3$) trajectories, supervised by reprojected 2D tracks. The method supports global SE($3$) motion for static parts and local per-track SE($3$) motions for dynamic components, with rigidity masks derived from embeddings to softly segment motion groups. It demonstrates strong data efficiency and versatility across downstream segmentation, self-supervised depth estimation, and SE($3$) trajectory estimation, using either per-video optimization or dataset-scale pretraining. This differentiable framework enables learning robust priors over geometry and object rigidity from real-world video, with clear implications for 3D scene understanding in vision, robotics, and related fields.
Abstract
Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least-squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re-projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene-specific structure - highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self-supervised depth estimation. Our findings suggest that SIRE can learn strong geometry and motion rigidity priors from video data, with minimal supervision.
