Refining Pre-Trained Motion Models

Xinglong Sun; Adam W. Harley; Leonidas J. Guibas

Refining Pre-Trained Motion Models

Xinglong Sun, Adam W. Harley, Leonidas J. Guibas

TL;DR

This work demonstrates that self-supervised finetuning of pre-trained motion models can degrade performance on real video, and introduces a two-stage approach to robustly refine such models. By first generating cycle-consistent pseudo-labels from unlabelled video using a frozen pre-trained teacher, and then finetuning with augmentations on these labels, the method yields reliable gains over fully-supervised baselines across optical flow and long-range tracking benchmarks. The approach is validated on RAFT and PIPs across MPI-Sintel, CroHD, Horse30, and Tap-Vid-DAVIS, with ablations showing the necessity of cycle-consistency and careful hyperparameter tuning. The practical impact lies in enabling domain-adaptive refinement of powerful, pre-trained motion models without requiring labeled data, improving performance in real-world surveillance, robotics, and video-analysis tasks.

Abstract

Given the difficulty of manually annotating motion in video, the current best motion estimation methods are trained with synthetic data, and therefore struggle somewhat due to a train/test gap. Self-supervised methods hold the promise of training directly on real video, but typically perform worse. These include methods trained with warp error (i.e., color constancy) combined with smoothness terms, and methods that encourage cycle-consistency in the estimates (i.e., tracking backwards should yield the opposite trajectory as tracking forwards). In this work, we take on the challenge of improving state-of-the-art supervised models with self-supervised training. We find that when the initialization is supervised weights, most existing self-supervision techniques actually make performance worse instead of better, which suggests that the benefit of seeing the new data is overshadowed by the noise in the training signal. Focusing on obtaining a "clean" training signal from real-world unlabelled video, we propose to separate label-making and training into two distinct stages. In the first stage, we use the pre-trained model to estimate motion in a video, and then select the subset of motion estimates which we can verify with cycle-consistency. This produces a sparse but accurate pseudo-labelling of the video. In the second stage, we fine-tune the model to reproduce these outputs, while also applying augmentations on the input. We complement this boot-strapping method with simple techniques that densify and re-balance the pseudo-labels, ensuring that we do not merely train on "easy" tracks. We show that our method yields reliable gains over fully-supervised methods in real videos, for both short-term (flow-based) and long-range (multi-frame) pixel tracking.

Refining Pre-Trained Motion Models

TL;DR

Abstract

Paper Structure (21 sections, 5 equations, 7 figures, 5 tables)

This paper contains 21 sections, 5 equations, 7 figures, 5 tables.

Introduction
Related Work
Method
Preliminaries: RAFT and PIPs
First Stage: Pseudo-label Generation
Second Stage: Motion Refinement
Experiments
Datasets
Evaluation Metrics
Baselines
Results
Ablation Studies
Discussion and conclusion
Acknowledgments.
Supplementary Material
...and 6 more sections

Figures (7)

Figure 1: Refining a pre-trained motion model. We apply a pre-trained motion model on an unlabelled video, yielding a dense set of tracks. We filter this down to a sparse set of cycle-consistent tracks, creating pseudo-labels. We then train on the pseudo-labelled data using augmentations, to improve the model.
Figure 2: Evaluation in Tap-Vid-DAVIS. Performance change in $\delta$ for each of the video compared with the baseline pre-trained PIPs model. Positive value denotes improvement (i.e., accuracy increasing)
Figure 3: Evaluation in MPI-Sintel. Percent change in EPE for each video compared with the pre-trained RAFT teed2020raft model. Negative value denotes improvement (i.e., error decreasing).
Figure 4: Comparison of optical flow visualizations on Sintel produced by the pre-trained RAFT teed2020raft and RAFT with ours, with difference highlighted in red boxes. Ours refine the flows by cleaning noisy tracks and completing missing objects.
Figure 5: Ablation study in CroHD. We show ATE_VIS and ATE_OCC for picking pseudo-labeled tracks with different threshold $\tau$ (a and b) and fine-tuned for different iterations $\kappa$ (c and d). All metrics are the smaller the better. Optimal performance is observed at $\tau$ of $2.5$ and $\kappa$ of $3000$.
...and 2 more figures

Refining Pre-Trained Motion Models

TL;DR

Abstract

Refining Pre-Trained Motion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)