Table of Contents
Fetching ...

ScaleFlow++: Robust and Accurate Estimation of 3D Motion from Video

Han Ling, Quansen Sun

TL;DR

This paper proposes a 3D motion perception method called ScaleFlow++ that is easy to generalize, and key insight is cross-scale matching, which extracts deep motion clues by matching objects in pairs of images at different scales.

Abstract

Perceiving and understanding 3D motion is a core technology in fields such as autonomous driving, robots, and motion prediction. This paper proposes a 3D motion perception method called ScaleFlow++ that is easy to generalize. With just a pair of RGB images, ScaleFlow++ can robustly estimate optical flow and motion-in-depth (MID). Most existing methods directly regress MID from two RGB frames or optical flow, resulting in inaccurate and unstable results. Our key insight is cross-scale matching, which extracts deep motion clues by matching objects in pairs of images at different scales. Unlike previous methods, ScaleFlow++ integrates optical flow and MID estimation into a unified architecture, estimating optical flow and MID end-to-end based on feature matching. Moreover, we also proposed modules such as global initialization network, global iterative optimizer, and hybrid training pipeline to integrate global motion information, reduce the number of iterations, and prevent overfitting during training. On KITTI, ScaleFlow++ achieved the best monocular scene flow estimation performance, reducing SF-all from 6.21 to 5.79. The evaluation of MID even surpasses RGBD-based methods. In addition, ScaleFlow++ has achieved stunning zero-shot generalization performance in both rigid and nonrigid scenes. Code is available at \url{https://github.com/HanLingsgjk/CSCV}.

ScaleFlow++: Robust and Accurate Estimation of 3D Motion from Video

TL;DR

This paper proposes a 3D motion perception method called ScaleFlow++ that is easy to generalize, and key insight is cross-scale matching, which extracts deep motion clues by matching objects in pairs of images at different scales.

Abstract

Perceiving and understanding 3D motion is a core technology in fields such as autonomous driving, robots, and motion prediction. This paper proposes a 3D motion perception method called ScaleFlow++ that is easy to generalize. With just a pair of RGB images, ScaleFlow++ can robustly estimate optical flow and motion-in-depth (MID). Most existing methods directly regress MID from two RGB frames or optical flow, resulting in inaccurate and unstable results. Our key insight is cross-scale matching, which extracts deep motion clues by matching objects in pairs of images at different scales. Unlike previous methods, ScaleFlow++ integrates optical flow and MID estimation into a unified architecture, estimating optical flow and MID end-to-end based on feature matching. Moreover, we also proposed modules such as global initialization network, global iterative optimizer, and hybrid training pipeline to integrate global motion information, reduce the number of iterations, and prevent overfitting during training. On KITTI, ScaleFlow++ achieved the best monocular scene flow estimation performance, reducing SF-all from 6.21 to 5.79. The evaluation of MID even surpasses RGBD-based methods. In addition, ScaleFlow++ has achieved stunning zero-shot generalization performance in both rigid and nonrigid scenes. Code is available at \url{https://github.com/HanLingsgjk/CSCV}.
Paper Structure (29 sections, 26 equations, 12 figures, 7 tables)

This paper contains 29 sections, 26 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Zero-shot Generalization in Real Scenes. Only needing to input two consecutive frames of images, ScaleFlow++ can robustly estimate dense 3D motion fields. The lower left corner of the image is the color-coded optical flow map, and the lower right corner is the motion-in-depth (MID). MID provides additional depth cues in the Z-axis direction, such as the dancer's left leg moving away from the camera and the right leg moving closer to the camera.
  • Figure 2: Cross-scale matching idea. We match cars between two consecutive frames $I_1$ and $I_2$, where the yellow car is close to the camera and the red car is relatively static to the camera. (a) The CV module in the optical flow baselineteed_raft_2020 matches cars at the same scale, while the yellow car cannot match well because of the scale change. (b) CSCV matches objects in the 3D scale space, so that each object can achieve the perfect matching of position and scale simultaneously. (c) We visualize the correlation hot map sampled from the CV module. $corr(I_1,I_2^\beta)$ means that CV is built based on $I_1$ and $I_2$, where the meaning of each pixel point $(x, y)$ is the correlation between the poin $I_1(x, y)$ and its corresponding point $I_2^\beta (\beta(x+\bm{f}(x,y)), \beta(y+\bm{f}(x,y)))$, $\bm{f}$ is the ground truth optical flow. The correlation of the scale-changed yellow car in $corr(I_1,I_2)$ is smaller than that of the scale-invariant red car, while the correlation of the yellow car in $corr(I_1,I_2^\beta)$ is higher, which proves that the scale change has an important impact on the stability of optical flow matching. (d) Dense scale change field $f_3$ estimated from our CSCV, the value of each pixel represents the scale change ratio.
  • Figure 3: Overview of ScaleFlow++. The network is mainly divided into two stages: initialization and iterative optimization. In the initialization stage, we construct a 4D correlation volume based on the 1/16 features extracted from ResNet and sample it using an all-zero optical flow field. Based on this, we regress the initialized 3D motion field. In the iterative optimization phase, we sample cross-scale correlation features from CSCV based on the current motion field, encode them, and send them to the GIR module for optimization. Output the final optimization result after N iterations.
  • Figure 4: The complete structure of cross-scale correlation volume (CSCV). The CSCV module is composed of three parts: multi scale correlation volume $C_m$, optical flow lookup operator $L_c^{of}$, and scale lookup operator $L_c^{s}$. After the optical flow field $(f_1,f_2)$ and scale change field $f_3$ are input, the optical flow feature $F_{of}$ and multi-scale optical flow feature $F_{multi}$ are first sampled from $C_m$ by the $L_c^{of}$ operator, while the scale feature $F_{scale}$ is obtained from $F_{multi}$ by the $L_c^{s}$ operator.
  • Figure 5: Global Iterative Refinement Module (GIR). Similar to Unet, GIR is mainly composed of funnel-shaped encoders and decoders. Specifically, GIR uses ConNextV2 with large kernel convolution characteristics at multiple scales, where a single convolution kernel can cover almost the entire frame. Combined with the pyramid pooling module at the smallest scale, it greatly enhances global perception capability.
  • ...and 7 more figures