Table of Contents
Fetching ...

Segmenting the motion components of a video: A long-term unsupervised model

Etienne Meunier, Patrick Bouthemy

TL;DR

This work tackles unsupervised, long-term motion segmentation from optical flow by introducing LT-MS, a transformer-assisted architecture that operates on flow-volume inputs to produce multiple, temporally coherent motion masks across entire video sequences. The method blends a space-time parametric motion model—a 12-parameter quadratic model in space with a cubic B-spline in time—with an ELBO-based training objective that includes a flow reconstruction term and a temporal-consistency term, while handling occlusions. A 3D U-Net encoder paired with a transformer decoder enables long-range interactions, and the approach supports variable sequence lengths without post-processing. Experimental results on four VOS benchmarks show competitive binary and multi-segment performance, with strong temporal stability and fast test-time inference, highlighting its suitability for downstream tasks like tracking and dynamic scene interpretation. The work also provides extensive ablations and appendix analyses to validate design choices and training procedures.

Abstract

Human beings have the ability to continuously analyze a video and immediately extract the motion components. We want to adopt this paradigm to provide a coherent and stable motion segmentation over the video sequence. In this perspective, we propose a novel long-term spatio-temporal model operating in a totally unsupervised way. It takes as input the volume of consecutive optical flow (OF) fields, and delivers a volume of segments of coherent motion over the video. More specifically, we have designed a transformer-based network, where we leverage a mathematically well-founded framework, the Evidence Lower Bound (ELBO), to derive the loss function. The loss function combines a flow reconstruction term involving spatio-temporal parametric motion models combining, in a novel way, polynomial (quadratic) motion models for the spatial dimensions and B-splines for the time dimension of the video sequence, and a regularization term enforcing temporal consistency on the segments. We report experiments on four VOS benchmarks, demonstrating competitive quantitative results, while performing motion segmentation on a whole sequence in one go. We also highlight through visual results the key contributions on temporal consistency brought by our method.

Segmenting the motion components of a video: A long-term unsupervised model

TL;DR

This work tackles unsupervised, long-term motion segmentation from optical flow by introducing LT-MS, a transformer-assisted architecture that operates on flow-volume inputs to produce multiple, temporally coherent motion masks across entire video sequences. The method blends a space-time parametric motion model—a 12-parameter quadratic model in space with a cubic B-spline in time—with an ELBO-based training objective that includes a flow reconstruction term and a temporal-consistency term, while handling occlusions. A 3D U-Net encoder paired with a transformer decoder enables long-range interactions, and the approach supports variable sequence lengths without post-processing. Experimental results on four VOS benchmarks show competitive binary and multi-segment performance, with strong temporal stability and fast test-time inference, highlighting its suitability for downstream tasks like tracking and dynamic scene interpretation. The work also provides extensive ablations and appendix analyses to validate design choices and training procedures.

Abstract

Human beings have the ability to continuously analyze a video and immediately extract the motion components. We want to adopt this paradigm to provide a coherent and stable motion segmentation over the video sequence. In this perspective, we propose a novel long-term spatio-temporal model operating in a totally unsupervised way. It takes as input the volume of consecutive optical flow (OF) fields, and delivers a volume of segments of coherent motion over the video. More specifically, we have designed a transformer-based network, where we leverage a mathematically well-founded framework, the Evidence Lower Bound (ELBO), to derive the loss function. The loss function combines a flow reconstruction term involving spatio-temporal parametric motion models combining, in a novel way, polynomial (quadratic) motion models for the spatial dimensions and B-splines for the time dimension of the video sequence, and a regularization term enforcing temporal consistency on the segments. We report experiments on four VOS benchmarks, demonstrating competitive quantitative results, while performing motion segmentation on a whole sequence in one go. We also highlight through visual results the key contributions on temporal consistency brought by our method.
Paper Structure (29 sections, 18 equations, 12 figures, 8 tables)

This paper contains 29 sections, 18 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Overall architecture of our multiple motion segmentation method ensuring temporal consistency with the loss term $\mathcal{L}_c$ and the B-spline space-time motion models $\theta_k$ (for $k=1,..,K$). It takes as input a volume of $T$ flow fields. It comprises a 3D U-net ($e$ and $d$ boxes) and a transformer decoder ($t$ box). It also involves positional encoding. A cross-attention product yields the $K$ segmentation masks corresponding to the input volume For the sake of clarity, the block diagram is represented for three motion segments ($K=3$). $\mathcal{L}_r$ is the flow-reconstruction loss term.
  • Figure 2: Illustration of the spatiotemporal spline-based motion model. Top group: respectively, input flows displayed with the HSV code for the swing video of DAVIS2016, binary segmentation ground truth, flows generated by the estimated spline-based motion models for the two segments. Bottom row: plot of the temporal evolution of the six estimated motion model parameters corresponding to the horizontal flow component for the foreground moving object. The $x$-axis is the video frame index.
  • Figure 3: Three groups of qualitative results regarding the ablation of the temporal-consistency loss term ($K=4$). They respectively correspond to the worm video of SegTrackV2, the dogs01 video of FBMS59, and the car-roundabout videos of DAVIS2016. For each group, the first row contains sample images with the segmentation ground-truth overlaid in yellow, the second row displays the input flows, the third and fourth rows show the predicted motion segmentations, respectively without and with the temporal consistency loss term. Clearly, this model component allows us to get far more consistent segments over time.
  • Figure 4: Results obtained with our LT-MS-K4 method for motocross-jump from DAVIS2016, people02 from FBMS59 and bmx from SegTrackV2. For each group, the first row samples flow fields (HSV color code) corresponding to the processed video. The second row contains the corresponding images of the video, where ground-truth of the moving object is overlaid in yellow. The third row shows the motion segments provided by our LT-MS-K4 method with one colour per segment. We constantly adopt the same color set for the three masks corresponding to moving objects (blue, red and orange), and we let the image for the background mask.
  • Figure 5: Results obtained with our LT-MS-K4 method ($K=4$). Four groups of results are displayed: monkey, hummingbird from SegTrackV2, goats01 from FBMS59 and libby from DAVIS2016. For each group, the first row samples successive flow fields (HSV color code) corresponding to the processed video. The second row contains the corresponding images of the video, where the ground-truth of the moving object is overlaid in yellow (when available at that frame). The third row shows the motion segments provided by our LT-MS-K4 method with one colour per segment. For all the results, we adopt the same color set for the three masks corresponding to the moving objects (blue, red and orange), and we let the background image for the background mask.
  • ...and 7 more figures