Segmenting the motion components of a video: A long-term unsupervised model
Etienne Meunier, Patrick Bouthemy
TL;DR
This work tackles unsupervised, long-term motion segmentation from optical flow by introducing LT-MS, a transformer-assisted architecture that operates on flow-volume inputs to produce multiple, temporally coherent motion masks across entire video sequences. The method blends a space-time parametric motion model—a 12-parameter quadratic model in space with a cubic B-spline in time—with an ELBO-based training objective that includes a flow reconstruction term and a temporal-consistency term, while handling occlusions. A 3D U-Net encoder paired with a transformer decoder enables long-range interactions, and the approach supports variable sequence lengths without post-processing. Experimental results on four VOS benchmarks show competitive binary and multi-segment performance, with strong temporal stability and fast test-time inference, highlighting its suitability for downstream tasks like tracking and dynamic scene interpretation. The work also provides extensive ablations and appendix analyses to validate design choices and training procedures.
Abstract
Human beings have the ability to continuously analyze a video and immediately extract the motion components. We want to adopt this paradigm to provide a coherent and stable motion segmentation over the video sequence. In this perspective, we propose a novel long-term spatio-temporal model operating in a totally unsupervised way. It takes as input the volume of consecutive optical flow (OF) fields, and delivers a volume of segments of coherent motion over the video. More specifically, we have designed a transformer-based network, where we leverage a mathematically well-founded framework, the Evidence Lower Bound (ELBO), to derive the loss function. The loss function combines a flow reconstruction term involving spatio-temporal parametric motion models combining, in a novel way, polynomial (quadratic) motion models for the spatial dimensions and B-splines for the time dimension of the video sequence, and a regularization term enforcing temporal consistency on the segments. We report experiments on four VOS benchmarks, demonstrating competitive quantitative results, while performing motion segmentation on a whole sequence in one go. We also highlight through visual results the key contributions on temporal consistency brought by our method.
