Towards Real-World Aerial Vision Guidance with Categorical 6D Pose Tracker

Jingtao Sun; Yaonan Wang; Danwei Wang

Towards Real-World Aerial Vision Guidance with Categorical 6D Pose Tracker

Jingtao Sun, Yaonan Wang, Danwei Wang

TL;DR

The paper tackles real-world aerial category-level 6-DoF pose tracking for guiding aerial manipulation. It introduces Robust6DoF, a three-stage tracker that fuses 2D-3D features with a Shape-Based Spatial-Temporal Augmentation and a Prior-Guided Keypoints Generation module, enabling robust inter-frame correspondence under severe viewpoint changes. Complementing this, PAD-Servo provides a pose-aware, decoupled control policy to drive both the onboard manipulator and the UAV, driven by the tracked pose. Extensive experiments on four public datasets plus real-world aerial tests demonstrate state-of-the-art accuracy, robustness to frame drops and noise, and real-time performance suitable for real-world aerial robotics guidance. The work offers a practical, integrated solution for category-level pose tracking and robotic vision guidance in high-maneuverability aerial contexts, with strong implications for autonomous manipulation tasks.

Abstract

Tracking the object 6-DoF pose is crucial for various downstream robot tasks and real-world applications. In this paper, we investigate the real-world robot task of aerial vision guidance for aerial robotics manipulation, utilizing category-level 6-DoF pose tracking. Aerial conditions inevitably introduce special challenges, such as rapid viewpoint changes in pitch and roll and inter-frame differences. To support these challenges in task, we firstly introduce a robust category-level 6-DoF pose tracker (Robust6DoF). This tracker leverages shape and temporal prior knowledge to explore optimal inter-frame keypoint pairs, generated under a priori structural adaptive supervision in a coarse-to-fine manner. Notably, our Robust6DoF employs a Spatial-Temporal Augmentation module to deal with the problems of the inter-frame differences and intra-class shape variations through both temporal dynamic filtering and shape-similarity filtering. We further present a Pose-Aware Discrete Servo strategy (PAD-Servo), serving as a decoupling approach to implement the final aerial vision guidance task. It contains two servo action policies to better accommodate the structural properties of aerial robotics manipulation. Exhaustive experiments on four well-known public benchmarks demonstrate the superiority of our Robust6DoF. Real-world tests directly verify that our Robust6DoF along with PAD-Servo can be readily used in real-world aerial robotic applications.

Towards Real-World Aerial Vision Guidance with Categorical 6D Pose Tracker

TL;DR

Abstract

Paper Structure (42 sections, 37 equations, 17 figures, 10 tables, 1 algorithm)

This paper contains 42 sections, 37 equations, 17 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Aerial Visual Object Tracking
Object 6-DoF Pose Estimation
Object 6-DoF Pose Tracking
Visual Servoing for Aerial Robotics Manipulation
Preliminary and task statement
Robot Frame and Velocity Transmission
Task Description
Approach
Categorical 6-DoF Pose Tracker: Robust6DoF
Network Overview
2D-3D Dense Fusion Transformer
Shape-Based Spatial-Temporal Augmentation
Prior-Guided Keypoints Generation and Match
...and 27 more sections

Figures (17)

Figure 1: The overall introduce of pipeline.a) By following the real-time 6-DoF pose tracking generated from our Robust6DoF, the aerial manipulator gradually begins to self-guide to the desired position where the targeted object's pose is infinitely close to the desired value. b) Proposed Robust6DoF achieves top performance on the metric of $IoU25$ with the best inference speed on public NOCS-REAL275 dataset. We test competitive category-level track-based and track-free (single pose estimation) methods utilizing their offical checkpoints and codes, respectively. All results are measured on the same device to be fair.
Figure 2: Establishment of aerial robot frame.${\rm{\{ W : }}{{\rm{O}}_W}{\rm{ - }}{{\rm{X}}_W}{Y_W}{Z_W}{\rm{\} }}$ means the world coordinate frame. ${\rm{\{ B : }}{{\rm{O}}_B}{\rm{ - }}{{\rm{X}}_B}{Y_B}{Z_B}{\rm{\} }}$ means the base coordinate frame of the aerial vehicle. ${\rm{\{ L_i : }}{{\rm{O}}_i}{\rm{ - }}{{\rm{X}}_i}{Y_i}{Z_i}{\rm{\} }}$ means the body frame of the $i$ link of robotic manipulator (i = 0,1,2,3,4), where $i = 0$ indicates the base frame of manipulator. ${\rm{\{ T : }}{{\rm{O}}_T}{\rm{ - }}{{\rm{X}}_T}{Y_T}{Z_T}{\rm{\} }}$ means the cooridate frame of the actuator. ${\rm{\{ C : }}{{\rm{O}}_C}{\rm{ - }}{{\rm{X}}_C}{Y_C}{Z_C}{\rm{\} }}$ means the onboard camera frame. The blue dot represents the 3D work space of onboard manipulator.
Figure 3: Complete framework of our category-level 6-DoF pose tracker termed Robust6DoF. It takes RGB-D video stream captured by the onboard camera as input, and tracks the 6-DoF pose ${{\cal P}^{(t)}}$ of the arbitrary object in the current observation. It mainly consists of three phases. Stage-1: 2D-3D dense fusion for pixel-point object's local descriptor $\tilde{F}_{obj}^{(t)}$ aggregation (shown in Fig. \ref{['FIG_structure']} (a)); Stage-2: shape-based spatial-temporal augmentation is employed for comprehensive refinement to obtain a group of embeddings $\{ \tilde{F}_{obj}^{(t)},\tilde{F}_{temp}^{(t)},\tilde{F}_{aug}^{(t)}\}$, taking advantage of both temporal prior and shape prior knowledge (shown in Fig. \ref{['FIG_structure']} (b) and (c)); and Stage-3: prior-guided keypoints generation and matching for $n$ inter-frame keypoints $(k_i^{(t - {\rm{1}})},k_j^{(t)})$ construction and accurate alignment in a coarse-to-fine manner. Utilizing these optimally matched keypoint pairs, we solve for the final object’s 6D pose using the PnP and RANSAC algorithms.
Figure 4: Detailed structure of the tracking workflow at the initial two stages.a) 2D-3D Dense Fusion Tramsformer. The image crop and point patch serve as inputs to generate the fused local descriptor $\tilde{F}_{obj}^{(t)}$ for arbitrary instances in current $t$-th frame. This component primarily consists of two parts: i) The WSA layer is employed for pixel-point dense fusion; ii) The scaled dot-product attention for local feature aggregation. b) Spatial-Temporal Filtering Encoder. It exploits the temporal knowledge from previous $t-1$-th frame to current one via the proposed temporal dynamic filtering. c) Augmentation Decoder along with shape-similarity filtering. These blocks leverage the proposed shape-similarity filtering to augment the temporal embedding $\tilde{F}_{temp}^{(t)}$, effectively addressing the challenge of the intra-category variability.
Figure 5: Complete flowchart of our proposed PAD-Servo. According to the object's 6DoF pose ${{\cal P}^{(t)}}$ estimated from our Robust6DoF at the current $t$-th timestep, we introduce a decomposed policy to achieve comparable and robust aerial guidance for aerial manipulator.
...and 12 more figures

Towards Real-World Aerial Vision Guidance with Categorical 6D Pose Tracker

TL;DR

Abstract

Towards Real-World Aerial Vision Guidance with Categorical 6D Pose Tracker

Authors

TL;DR

Abstract

Table of Contents

Figures (17)