GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry

Xiankang He; Peile Lin; Ying Cui; Dongyan Guo; Chunhua Shen; Xiaoqin Zhang

GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry

Xiankang He, Peile Lin, Ying Cui, Dongyan Guo, Chunhua Shen, Xiaoqin Zhang

TL;DR

This work proposes a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation in dynamic scenes.

Abstract

Motion segmentation in dynamic scenes is highly challenging, as conventional methods heavily rely on estimating camera poses and point correspondences from inherently noisy motion cues. Existing statistical inference or iterative optimization techniques that struggle to mitigate the cumulative errors in multi-stage pipelines often lead to limited performance or high computational cost. In contrast, we propose a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation. Our key insight is to bypass explicit correspondence estimation and instead let the model learn to implicitly disentangle object and camera motion. Supported by recent advances in 4D scene geometry reconstruction (e.g., $π^3$), the proposed method leverages reliable camera poses and rich spatial-temporal priors, which ensure stable training and robust inference for the model. Extensive experiments demonstrate that by eliminating complex pre-processing and iterative refinement, our approach achieves state-of-the-art motion segmentation performance with high efficiency. The code is available at:https://github.com/zjutcvg/GeoMotion.

GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry

TL;DR

Abstract

), the proposed method leverages reliable camera poses and rich spatial-temporal priors, which ensure stable training and robust inference for the model. Extensive experiments demonstrate that by eliminating complex pre-processing and iterative refinement, our approach achieves state-of-the-art motion segmentation performance with high efficiency. The code is available at:https://github.com/zjutcvg/GeoMotion.

Paper Structure (23 sections, 2 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 2 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Conventional Motion Segmentation
Feed forward 4D Reconstruction
Method
Feed-forward Architecture
Visual Geometry Backbone
Training Loss
Experiment
Experimental Settings
Comparison with Motion Segmentation Methods
Quantitative Comparison
Qualitative Comparison
Comparison with Reconstruction methods
Ablation Study
...and 8 more sections

Figures (8)

Figure 1: Overview of GeoMotion. Given an input video, our framework integrates 4D geometric priors from a pretrained reconstruction model ($\pi^3$) and local pixel-level motion from optical flow to infer dynamic object masks. By leveraging 4D geometric priors, the proposed GeoMotion disentangles object motion from camera motion in a single feed-forward manner.
Figure 2: Architecture of the proposed GeoMotion framework. The model comprises a feature aggregation module and a motion decoder. The former fuses latent 4D features, optical flow features, and camera pose embeddings, while the latter employs multi-head self-attention to decode motion masks. The design enables end-to-end feed-forward motion segmentation without iterative refinement.
Figure 3: Visualization of $\pi^3$ features across alternating attention layers. Shallow layers preserve semantic object-level features, whereas deeper layers encode high-level global geometry. Their fusion yields robust latent 4D representations that support accurate motion segmentation.
Figure 4: Qualitative comparison on multiple benchmarks. Visual comparison with state-of-the-art methods including OCLR-Flow OCLR, SegAnyMotion seganymotion, and RoMo romo. The proposed method produces geometrically complete and visually coherent motion masks, preserving fine object details and boundaries under complex scenes.
Figure 5: Initialization comparison for the motion decoder. Initializing with $\pi^3$ pretrained parameters yields faster convergence and higher IoU compared to random initialization, demonstrating the benefit of large scale geometry pretraining.
...and 3 more figures

GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry

TL;DR

Abstract

GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry

Authors

TL;DR

Abstract

Table of Contents

Figures (8)