Table of Contents
Fetching ...

SplatFlow: Learning Multi-frame Optical Flow via Splatting

Bo Wang, Yifan Zhang, Jian Li, Yang Yu, Zhenping Sun, Li Liu, Dewen Hu

TL;DR

An efficient MOFE framework named SplatFlow is proposed, which introduces the differentiable splatting transformation to align the previous frame’s motion feature and designs a Final-to-All embedding method to input the aligned motion feature into the current frame’s estimation, thus remodeling the existing two-frame backbones.

Abstract

The occlusion problem remains a crucial challenge in optical flow estimation (OFE). Despite the recent significant progress brought about by deep learning, most existing deep learning OFE methods still struggle to handle occlusions; in particular, those based on two frames cannot correctly handle occlusions because occluded regions have no visual correspondences. However, there is still hope in multi-frame settings, which can potentially mitigate the occlusion issue in OFE. Unfortunately, multi-frame OFE (MOFE) remains underexplored, and the limited studies on it are mainly specially designed for pyramid backbones or else obtain the aligned previous frame's features, such as correlation volume and optical flow, through time-consuming backward flow calculation or non-differentiable forward warping transformation. This study proposes an efficient MOFE framework named SplatFlow to address these shortcomings. SplatFlow introduces the differentiable splatting transformation to align the previous frame's motion feature and designs a Final-to-All embedding method to input the aligned motion feature into the current frame's estimation, thus remodeling the existing two-frame backbones. The proposed SplatFlow is efficient yet more accurate, as it can handle occlusions properly. Extensive experimental evaluations show that SplatFlow substantially outperforms all published methods on the KITTI2015 and Sintel benchmarks. Especially on the Sintel benchmark, SplatFlow achieves errors of 1.12 (clean pass) and 2.07 (final pass), with surprisingly significant 19.4% and 16.2% error reductions, respectively, from the previous best results submitted. The code for SplatFlow is available at https://github.com/wwsource/SplatFlow.

SplatFlow: Learning Multi-frame Optical Flow via Splatting

TL;DR

An efficient MOFE framework named SplatFlow is proposed, which introduces the differentiable splatting transformation to align the previous frame’s motion feature and designs a Final-to-All embedding method to input the aligned motion feature into the current frame’s estimation, thus remodeling the existing two-frame backbones.

Abstract

The occlusion problem remains a crucial challenge in optical flow estimation (OFE). Despite the recent significant progress brought about by deep learning, most existing deep learning OFE methods still struggle to handle occlusions; in particular, those based on two frames cannot correctly handle occlusions because occluded regions have no visual correspondences. However, there is still hope in multi-frame settings, which can potentially mitigate the occlusion issue in OFE. Unfortunately, multi-frame OFE (MOFE) remains underexplored, and the limited studies on it are mainly specially designed for pyramid backbones or else obtain the aligned previous frame's features, such as correlation volume and optical flow, through time-consuming backward flow calculation or non-differentiable forward warping transformation. This study proposes an efficient MOFE framework named SplatFlow to address these shortcomings. SplatFlow introduces the differentiable splatting transformation to align the previous frame's motion feature and designs a Final-to-All embedding method to input the aligned motion feature into the current frame's estimation, thus remodeling the existing two-frame backbones. The proposed SplatFlow is efficient yet more accurate, as it can handle occlusions properly. Extensive experimental evaluations show that SplatFlow substantially outperforms all published methods on the KITTI2015 and Sintel benchmarks. Especially on the Sintel benchmark, SplatFlow achieves errors of 1.12 (clean pass) and 2.07 (final pass), with surprisingly significant 19.4% and 16.2% error reductions, respectively, from the previous best results submitted. The code for SplatFlow is available at https://github.com/wwsource/SplatFlow.
Paper Structure (39 sections, 10 equations, 8 figures, 8 tables)

This paper contains 39 sections, 10 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The overall architecture of the proposed SplatFlow framework is designed for the single-resolution iterative backbone, e.g., RAFT RAFT. The thick orange and purple arrows show the estimation processes $P_{t-1\rightarrow t}$ and $P_{t\rightarrow t+1}$, respectively. The framework encompasses the encoding, alignment, and embedding of the motion feature. The encoder network encodes the motion feature (green part). We use a Splatting-based method to achieve the motion feature's unidirectional and differentiable alignment (blue part). We use a Final-to-All embedding method to input the aligned motion feature into the frame $t$'s estimation process (red part). The detailed data flow for iteratively updating optical flow in RAFT RAFT is summarized as a single-resolution iteration module (SIM) in the purple part.
  • Figure 2: Schematic illustration of the whole generation process of the motion feature $M_{t-1,n}$. $O_{t-1,n-1}$ is the frame $t-1$'s optical flow of the ${n-1}^{th}$ iteration, and $C_{t-1,n}$ is the correlation feature. We use $conv@n\times n,l$ to represent a convolution operation with a kernel of $n$, output channels of $l$, and a stride of 1. We use C to represent the concatenation operation.
  • Figure 3: (a) Schematic illustration of sampling. (b) Schematic illustration of splatting. Blue rectangles/circles represent the pixels providing contributions in sampling or splatting. Green circles represent the pixels receiving contributions in sampling. Purple rectangles represent the pixels receiving contributions in splatting. The difference between (a) and (b) is that sampling uses the values of surrounding integer pixels to calculate the value of the sub-pixel, while splatting uses the values of surrounding sub-pixels to calculate the value of the integer pixel. (c) Schematic illustration of the proposed Splatting-based motion feature alignment method. We use the frame $t-1$'s optical flow $O_{t-1,n}$ after the $n$th iteration to splat the motion feature $M_{t-1,n}$ to the frame $t$'s coordinate with non-normalized contributions. All contributions distributed to the same integer pixel $S_j$ are normalized and added to obtain the aligned motion feature $A_{t-1,n}$.
  • Figure 4: (a) The schematic diagrams of the One-to-One embedding method. (b) The schematic diagrams of the Final-to-Final embedding method. $O_{t,n}$ is the frame $t-1$'s estimated low-resolution optical flow. $A_{t-1,n}$ is the $n$th iteration's aligned motion feature.
  • Figure 5: Datasets evaluation results vs. the number of iterations at inference time. Our method and GMA GMA quickly converge as single-resolution iterative methods, comparable with final results only after 12 iterations. Our method surpasses GMA GMA's final performance only after less than five average iterations.
  • ...and 3 more figures