Table of Contents
Fetching ...

Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world

Han Ling, Yinghui Sun, Quansen Sun, Ivor Tsang, Yuhui Zheng

TL;DR

Because SAG can directly self-supervised train the state-of-the-art deep networks, it can greatly improve the generalization performance of self-supervised methods on current mainstream optical flow and stereo-matching datasets.

Abstract

A significant challenge facing current optical flow and stereo methods is the difficulty in generalizing them well to the real world. This is mainly due to the high costs required to produce datasets, and the limitations of existing self-supervised methods on fuzzy results and complex model training problems. To address the above challenges, we propose a unified self-supervised generalization framework for optical flow and stereo tasks: Self-Assessed Generation (SAG). Unlike previous self-supervised methods, SAG is data-driven, using advanced reconstruction techniques to construct a reconstruction field from RGB images and generate datasets based on it. Afterward, we quantified the confidence level of the generated results from multiple perspectives, such as reconstruction field distribution, geometric consistency, and structural similarity, to eliminate inevitable defects in the generation process. We also designed a 3D flight foreground automatic rendering pipeline in SAG to encourage the network to learn occlusion and motion foreground. Experimentally, because SAG does not involve changes to methods or loss functions, it can directly self-supervised train the state-of-the-art deep networks, greatly improving the generalization performance of self-supervised methods on current mainstream optical flow and stereo-matching datasets. Compared to previous training modes, SAG is more generalized, cost-effective, and accurate.

Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world

TL;DR

Because SAG can directly self-supervised train the state-of-the-art deep networks, it can greatly improve the generalization performance of self-supervised methods on current mainstream optical flow and stereo-matching datasets.

Abstract

A significant challenge facing current optical flow and stereo methods is the difficulty in generalizing them well to the real world. This is mainly due to the high costs required to produce datasets, and the limitations of existing self-supervised methods on fuzzy results and complex model training problems. To address the above challenges, we propose a unified self-supervised generalization framework for optical flow and stereo tasks: Self-Assessed Generation (SAG). Unlike previous self-supervised methods, SAG is data-driven, using advanced reconstruction techniques to construct a reconstruction field from RGB images and generate datasets based on it. Afterward, we quantified the confidence level of the generated results from multiple perspectives, such as reconstruction field distribution, geometric consistency, and structural similarity, to eliminate inevitable defects in the generation process. We also designed a 3D flight foreground automatic rendering pipeline in SAG to encourage the network to learn occlusion and motion foreground. Experimentally, because SAG does not involve changes to methods or loss functions, it can directly self-supervised train the state-of-the-art deep networks, greatly improving the generalization performance of self-supervised methods on current mainstream optical flow and stereo-matching datasets. Compared to previous training modes, SAG is more generalized, cost-effective, and accurate.

Paper Structure

This paper contains 26 sections, 34 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: The Idea of Data-driven Self-supervised Training. We integrate optical flow and stereo matching into a unified framework (SAG) for self-supervised training. The core feature of SAG is data-driven, which extracts 3D structures from readily available monocular camera images or videos, filters out abnormal parts,  and generates customized datasets for 3D tasks. Unlike previous methods, SAG does not involve self-supervised loss and directly trains the model using generated data, achieving generalization performance and generality beyond loss-driven self-supervised schemes.
  • Figure 2: Zero-shot Generalization of SAG in the Real World. SAG achieved stunning real-world zero-shot generalization effects using only user-collected monocular RGB images. Moreover, unlike previous loss-driven self-supervised methods, SAG suits most existing methods.
  • Figure 3: SAG Pipeline. Firstly, based on user-collected images, 3D scenes are reconstructed using 3DGS/NeRF and rendered to obtain RGB image pairs, depth maps, reconstruction confidence (RC), and occlusion. Afterward, the label calculation module will calculate the corresponding task label based on the previous rendering result and input the calculated label into the defect detection module to remove the defective part. The final generated label will also cover the 3D flight foreground to compensate for the insufficient foreground in the generated dataset.
  • Figure 4: Median Depth v.s. Mean Depth. Left: Rendered RGB image and two different depths of 3DGS and NeRF. Right: The weights $w$ on three different rays (based on NeRF): well-trained ray B, ray C with incorrect surfaces, and ray A with potential multiple surfaces. We found that the mean depth in A and C is clearly incorrect, as the weighted depth of the wrong surface interfered with the final result, especially when there were incorrect weights at the far end of the ray. And the median depth can reduce the interference of these erroneous surfaces.
  • Figure 5: Optical Flow Calculation and Occlusion. The optical flow calculation is divided into two steps. Firstly, point $p_1$ in the pixel plane of the first frame is projected onto point $p_1^{w}$ in the world coordinate system. Then, based on the pose $\bm{P}_2$, the point $p_1^{w}$ is projected onto the pixel plane of the second frame to obtain point $p_{1'}$. This figure also shows the occlusion situation, with a solid surface between $p_1^{w}$ and the camera's optical center $O_2$. At this time, we can evaluate whether ray $\bm{r_{1'}}$ is occluded by calculating the integral value of the weight $w$ between $p_1^{w}$ and $O_2$.
  • ...and 10 more figures