Table of Contents
Fetching ...

FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose Estimation

Jingyi Tang, Gu Wang, Zeyu Chen, Shengquan Li, Xiu Li, Xiangyang Ji

TL;DR

This work tackles underwater 6D pose estimation using RGB data only, addressing annotation scarcity and the sim2real domain gap. It introduces FAFA, a two-stage framework employing frequency-aware augmentation to inject target-domain style into synthetic data and flow-aided self-supervision for end-to-end domain adaptation, leveraging a teacher–student architecture with pseudo-flow labels. Key contributions include amplitude mix/dropout in the Fourier domain and a multi-level alignment strategy that couples image-level constraints with feature-level similarity and a shape-constrained optical flow for pose refinement via $f^{s \rightarrow t}$ and $f^{tea}, f^{stu}$. On ROV6D and DeepURL benchmarks, FAFA achieves state-of-the-art performance without real pose annotations, demonstrating strong practical potential for underwater robotic perception, with limitations discussed in the Appendix.

Abstract

Although methods for estimating the pose of objects in indoor scenes have achieved great success, the pose estimation of underwater objects remains challenging due to difficulties brought by the complex underwater environment, such as degraded illumination, blurring, and the substantial cost of obtaining real annotations. In response, we introduce FAFA, a Frequency-Aware Flow-Aided self-supervised framework for 6D pose estimation of unmanned underwater vehicles (UUVs). Essentially, we first train a frequency-aware flow-based pose estimator on synthetic data, where an FFT-based augmentation approach is proposed to facilitate the network in capturing domain-invariant features and target domain styles from a frequency perspective. Further, we perform self-supervised training by enforcing flow-aided multi-level consistencies to adapt it to the real-world underwater environment. Our framework relies solely on the 3D model and RGB images, alleviating the need for any real pose annotations or other-modality data like depths. We evaluate the effectiveness of FAFA on common underwater object pose benchmarks and showcase significant performance improvements compared to state-of-the-art methods. Code is available at github.com/tjy0703/FAFA.

FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose Estimation

TL;DR

This work tackles underwater 6D pose estimation using RGB data only, addressing annotation scarcity and the sim2real domain gap. It introduces FAFA, a two-stage framework employing frequency-aware augmentation to inject target-domain style into synthetic data and flow-aided self-supervision for end-to-end domain adaptation, leveraging a teacher–student architecture with pseudo-flow labels. Key contributions include amplitude mix/dropout in the Fourier domain and a multi-level alignment strategy that couples image-level constraints with feature-level similarity and a shape-constrained optical flow for pose refinement via and . On ROV6D and DeepURL benchmarks, FAFA achieves state-of-the-art performance without real pose annotations, demonstrating strong practical potential for underwater robotic perception, with limitations discussed in the Appendix.

Abstract

Although methods for estimating the pose of objects in indoor scenes have achieved great success, the pose estimation of underwater objects remains challenging due to difficulties brought by the complex underwater environment, such as degraded illumination, blurring, and the substantial cost of obtaining real annotations. In response, we introduce FAFA, a Frequency-Aware Flow-Aided self-supervised framework for 6D pose estimation of unmanned underwater vehicles (UUVs). Essentially, we first train a frequency-aware flow-based pose estimator on synthetic data, where an FFT-based augmentation approach is proposed to facilitate the network in capturing domain-invariant features and target domain styles from a frequency perspective. Further, we perform self-supervised training by enforcing flow-aided multi-level consistencies to adapt it to the real-world underwater environment. Our framework relies solely on the 3D model and RGB images, alleviating the need for any real pose annotations or other-modality data like depths. We evaluate the effectiveness of FAFA on common underwater object pose benchmarks and showcase significant performance improvements compared to state-of-the-art methods. Code is available at github.com/tjy0703/FAFA.
Paper Structure (13 sections, 9 equations, 4 figures, 4 tables)

This paper contains 13 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Abstract illustration of our proposed approach. We initially train the network using annotated synthetic RGB data. Subsequently, real-world unlabeled data are employed for self-supervised learning to further refine the network. After that, The network's performance exhibits a significant improvement. The green and red bounding boxes denote the ground truth and prediction, respectively.
  • Figure 2: Two-stage self-supervised framework for underwater object pose estimation. Top: We introduce an FFT-based data augmentation strategy, leveraging random real-world images to generate the augmented ones. We initialize the teacher-student framework based on the pre-trained network. The real image, along with a set of synthetic images generated based on $\textbf{P}_0$, are input to the self-supervised network. The teacher/student network consists of three components: (1) A feature encoder. (2) A flow regressor teed2020raft that outputs a hidden feature $h$ and further estimates a flow field $f$. (3) A pose regressor which predicts a relative pose $\textbf{P}_{\Delta}$ to generate a shape-constraint flow field. The shape-constraint flow is then feedback to the flow regressor for iterative network optimization. Finally, the refined pose is output. During self-supervision, the results ($f^{stu}$, $\textbf{P}^{stu}$) estimated from the noisy inputs are supervised by pseudo-labels ($f^{tea}$, $\textbf{P}^{tea}$) obtained from clean images. Bottom: We optimize flow and pose estimation by simultaneously applying image-level and feature-level alignment constraints.
  • Figure 3: FFT-based augmentation strategy. Top: The FFT algorithm extracts amplitude and phase components from a synthetic image. Images are then reconstructed by exclusively utilizing either the amplitude or the phase component through the inverse Fourier transform (iFFT). Bottom: Real image amplitude information is introduced, and a new blended amplitude component is formed by mixing it with the synthetic image amplitude. Subsequently, an augmented image is reconstructed by combining this new amplitude component with the synthetic image's phase component.
  • Figure 4: Qualitative results on (a) ROV6D and (b) DeepURL. The results are obtained before (top) and after (bottom) employing our self-supervision, respectively. The green and red wireframes represent the ground-truth pose and the results.