Table of Contents
Fetching ...

Trustworthy Self-Attention: Enabling the Network to Focus Only on the Most Relevant References

Yu Jing, Tan Yujuan, Ren Ao, Liu Duo

TL;DR

This work makes full use of online occlusion recognition information to construct occlusion extended visual features and two strong constraints, allowing the network to learn to focus only on the most relevant references without requiring occlusion ground truth to participate in the training of the network.

Abstract

The prediction of optical flow for occluded points is still a difficult problem that has not yet been solved. Recent methods use self-attention to find relevant non-occluded points as references for estimating the optical flow of occluded points based on the assumption of self-similarity. However, they rely on visual features of a single image and weak constraints, which are not sufficient to constrain the trained network to focus on erroneous and weakly relevant reference points. We make full use of online occlusion recognition information to construct occlusion extended visual features and two strong constraints, allowing the network to learn to focus only on the most relevant references without requiring occlusion ground truth to participate in the training of the network. Our method adds very few network parameters to the original framework, making it very lightweight. Extensive experiments show that our model has the greatest cross-dataset generalization. Our method achieves much greater error reduction, 18.6%, 16.2%, and 20.1% for all points, non-occluded points, and occluded points respectively from the state-of-the-art GMA-base method, MATCHFlow(GMA), on Sintel Albedo pass. Furthermore, our model achieves state-of-the-art performance on the Sintel bench-marks, ranking \#1 among all published methods on Sintel clean pass. The code will be open-source.

Trustworthy Self-Attention: Enabling the Network to Focus Only on the Most Relevant References

TL;DR

This work makes full use of online occlusion recognition information to construct occlusion extended visual features and two strong constraints, allowing the network to learn to focus only on the most relevant references without requiring occlusion ground truth to participate in the training of the network.

Abstract

The prediction of optical flow for occluded points is still a difficult problem that has not yet been solved. Recent methods use self-attention to find relevant non-occluded points as references for estimating the optical flow of occluded points based on the assumption of self-similarity. However, they rely on visual features of a single image and weak constraints, which are not sufficient to constrain the trained network to focus on erroneous and weakly relevant reference points. We make full use of online occlusion recognition information to construct occlusion extended visual features and two strong constraints, allowing the network to learn to focus only on the most relevant references without requiring occlusion ground truth to participate in the training of the network. Our method adds very few network parameters to the original framework, making it very lightweight. Extensive experiments show that our model has the greatest cross-dataset generalization. Our method achieves much greater error reduction, 18.6%, 16.2%, and 20.1% for all points, non-occluded points, and occluded points respectively from the state-of-the-art GMA-base method, MATCHFlow(GMA), on Sintel Albedo pass. Furthermore, our model achieves state-of-the-art performance on the Sintel bench-marks, ranking \#1 among all published methods on Sintel clean pass. The code will be open-source.
Paper Structure (16 sections, 1 equation, 5 figures, 8 tables)

This paper contains 16 sections, 1 equation, 5 figures, 8 tables.

Figures (5)

  • Figure 2: Comparison of three methods for training and reasoning of self-attention. Firstly, both GMA and GMFlow utilize self-attention solely on visual features of image0 to obtain a 4D attention matrix. However, due to the insufficient occlusion information carried by the image0 features, self-attention cannot effectively identify and exclude the focus on occluded points. This results in erroneous reference information obtained from this attention mechanism (Untrustworthy). Therefore, our method addresses this issue by incorporating occlusion information into the self-attention to create occlusion extended features (Trustworthy). Secondly, GMA utilizes weak indirect constraints to train, while GMFlow employs weak direct constraints. This results in better attention performance achieved by GMFlow. However, there is still the issue of attention scattering, where non-occluded points do not solely attend to themselves, leading to inaccurate estimation of optical flow on the surfaces of common objects that have different depths of field or are non-rigid (Weakly relevant). Our method solves the problem of attention scattering by introducing two strong direct constraints, ensuring that non-occluded points only attend to themselves (The most relevant). The performance of these methods are illustrated in Figure 1.
  • Figure 3: The overall framework of our method. Our method consists of four stages that are executed sequentially in order. The first stage takes image pairs as input and extracts local image features $f0_q$ and $f1_q$ at 4x downsampling, as well as global image features F0 and F1 at 8x downsampling. It also calculates the $flow_{GM}$. The second stage takes the global image features F0, F1, and $flow_{GM}$ as input to calculate the occlusion information. The third stage uses F0, occlusion information, and $flow_{GM}$ as input to obtain the trustworthy attention matrix and rectified flow. The fourth stage takes F0 upsampled by 2x and convolved as the Context feature, enhanced rectified flow as the initial flow for the Refinement process. It also utilizes $f0_q$ and $f1_q$ to calculate the correlation volume. We replace the attention matrix in GMA with our trustworthy attention matrix.After sufficient iterations of Refinement, the current refined flow is used as the final output.
  • Figure 4: The Mean of each metric for each possible image on the Sintel Clean Pass. "Noc" means non-occluded. "Occ" means occluded. In (a) (c), the smaller the value, the better. In (b), the larger the value, the better. It can be seen that this method has obvious advantages over the other two methods.
  • Figure 5: The performance of our method depends on the performance of the occlusion information detector.When confronted with input images suffering from suboptimal qualities such as significant motion blur and a lack of comprehensive training data, the detector encounters challenges in accurate occlusion recognition, leading to substantial optical flow prediction errors. However, by increasing the amount of training data, the occlusion recognition effectiveness improves, resulting in smaller optical flow prediction errors.
  • Figure 6: The mean reference distance of all points on the image. It shows that after adding the Strong Repulsion constraint, the non-occluded points have basically learned to attend to themselves only, indicating that the network has learned a good occlusion-aware feature representation. After adding the Strong Attraction constraint, visually, almost all non-occluded points have learned to reference themselves only. One or more “+” indicate stacking on the previous method.