Table of Contents
Fetching ...

Rethink Predicting the Optical Flow with the Kinetics Perspective

Yuhao Cheng, Siru Zhang, Yiqiang Yan

TL;DR

This work reframes optical flow estimation from a kinetics perspective to address the high cost of dense correlation volumes and occlusion-induced artifacts. It directly predicts flow from high-level features using a Transformer-based Motion Decoder, paired with a differentiable WarpNet that jointly handles warping and occlusion. A kinetics-guided self-supervised learning strategy leverages unlabeled data through a teacher–student framework based on constant-velocity priors, enabling robust motion understanding without extensive labeling. The approach achieves strong results on Sintel and KITTI benchmarks, especially under occlusion and fast motion, while offering improved efficiency and a public code release to foster adoption. Overall, the paper demonstrates that integrating kinetics insights with self-supervision and feature-centric flow prediction yields competitive performance and practical benefits for real-world optical flow tasks.

Abstract

Optical flow estimation is one of the fundamental tasks in low-level computer vision, which describes the pixel-wise displacement and can be used in many other tasks. From the apparent aspect, the optical flow can be viewed as the correlation between the pixels in consecutive frames, so continuously refining the correlation volume can achieve an outstanding performance. However, it will make the method have a catastrophic computational complexity. Not only that, the error caused by the occlusion regions of the successive frames will be amplified through the inaccurate warp operation. These challenges can not be solved only from the apparent view, so this paper rethinks the optical flow estimation from the kinetics viewpoint.We propose a method combining the apparent and kinetics information from this motivation. The proposed method directly predicts the optical flow from the feature extracted from images instead of building the correlation volume, which will improve the efficiency of the whole network. Meanwhile, the proposed method involves a new differentiable warp operation that simultaneously considers the warping and occlusion. Moreover, the proposed method blends the kinetics feature with the apparent feature through the novel self-supervised loss function. Furthermore, comprehensive experiments and ablation studies prove that the proposed novel insight into how to predict the optical flow can achieve the better performance of the state-of-the-art methods, and in some metrics, the proposed method outperforms the correlation-based method, especially in situations containing occlusion and fast moving. The code will be public.

Rethink Predicting the Optical Flow with the Kinetics Perspective

TL;DR

This work reframes optical flow estimation from a kinetics perspective to address the high cost of dense correlation volumes and occlusion-induced artifacts. It directly predicts flow from high-level features using a Transformer-based Motion Decoder, paired with a differentiable WarpNet that jointly handles warping and occlusion. A kinetics-guided self-supervised learning strategy leverages unlabeled data through a teacher–student framework based on constant-velocity priors, enabling robust motion understanding without extensive labeling. The approach achieves strong results on Sintel and KITTI benchmarks, especially under occlusion and fast motion, while offering improved efficiency and a public code release to foster adoption. Overall, the paper demonstrates that integrating kinetics insights with self-supervision and feature-centric flow prediction yields competitive performance and practical benefits for real-world optical flow tasks.

Abstract

Optical flow estimation is one of the fundamental tasks in low-level computer vision, which describes the pixel-wise displacement and can be used in many other tasks. From the apparent aspect, the optical flow can be viewed as the correlation between the pixels in consecutive frames, so continuously refining the correlation volume can achieve an outstanding performance. However, it will make the method have a catastrophic computational complexity. Not only that, the error caused by the occlusion regions of the successive frames will be amplified through the inaccurate warp operation. These challenges can not be solved only from the apparent view, so this paper rethinks the optical flow estimation from the kinetics viewpoint.We propose a method combining the apparent and kinetics information from this motivation. The proposed method directly predicts the optical flow from the feature extracted from images instead of building the correlation volume, which will improve the efficiency of the whole network. Meanwhile, the proposed method involves a new differentiable warp operation that simultaneously considers the warping and occlusion. Moreover, the proposed method blends the kinetics feature with the apparent feature through the novel self-supervised loss function. Furthermore, comprehensive experiments and ablation studies prove that the proposed novel insight into how to predict the optical flow can achieve the better performance of the state-of-the-art methods, and in some metrics, the proposed method outperforms the correlation-based method, especially in situations containing occlusion and fast moving. The code will be public.
Paper Structure (23 sections, 11 equations, 9 figures, 12 tables)

This paper contains 23 sections, 11 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Main challenges (better in color). (a) Dense correlation volume represents pixel-wise correspondences between frames. Estimating optical flow based on the dense correlation will bring a high cost both in memory and computation. (b) Occlusion leads to distinct visual artifacts with the warping operation. The first column is two consecutive frames. The upper image in the 2nd column is the optical flow, and the lower is the occlusion map. From the 3rd column, we can know that warping without the occlusion map will cause some unreal visual artifacts(top), and the occlusion map will avoid these(bottom). (c) The upper part is the synthetic data used for training, while the lower part is the situation where the trained models are used. Obviously, these two domains are severely different in texture, however, they have the same characteristics in the kinetics.
  • Figure 2: Overview of our proposed method. In the first Apparent Information Learning phase, we will use the labeled data to train the whole network, which is similar to previous methods. In this phase, the network will obtain the ability to get the apparent information. We obtain the predicted occlusion map via WarpNet based on the bi-directional flow output from our proposed Motion Decoder using the perceptual loss and occlusion loss to control training of WarpNet. After that, we obtain the predicted flow $\hat{f}_{<t_0,t_1>}$ and calculate the loss with ground truth. After the Apparent Information Learning phase, we will use the kinetics-guided self-supervised loss to make the network learn the motion information.
  • Figure 3: Structure of our Motion Decoder and WarpNet. In the Motion Decoder, we use the $F^{\prime}$ to denote the self-attention result and $F^{\prime\prime}$ to denote the cross-attention result.
  • Figure 4: Qualitative comparison of ours with the state-of-the-art methods on KITTI-2015 test set.
  • Figure 5: EPE on Sintel training set. Our model uses more unlabeled datasets DAVISCaelles_davis_2019 and High Speed Sinteljanai2017slow to fine-tune our model. Our method decreases more loss compared to the supervised style with the same iterations.
  • ...and 4 more figures