SuperPose: Improved 6D Pose Estimation with Robust Tracking and Mask-Free Initialization

Yu Deng; Jiahong Xue; Teng Cao; Yingxing Zhang; Lanxi Wen; Yiyang Chen

SuperPose: Improved 6D Pose Estimation with Robust Tracking and Mask-Free Initialization

Yu Deng, Jiahong Xue, Teng Cao, Yingxing Zhang, Lanxi Wen, Yiyang Chen

Abstract

We developed a robust solution for real-time 6D object detection in industrial applications by integrating FoundationPose, SAM2, and LightGlue, eliminating the need for retraining. Our approach addresses two key challenges: the requirement for an initial object mask in the first frame in FoundationPose and issues with tracking loss and automatic rotation for symmetric objects. The algorithm requires only a CAD model of the target object, with the user clicking on its location in the live feed during the initial setup. Once set, the algorithm automatically saves a reference image of the object and, in subsequent runs, employs LightGlue for feature matching between the object and the real-time scene, providing an initial prompt for detection. Tested on the YCB dataset and industrial components such as bleach cleanser and gears, the algorithm demonstrated reliable 6D detection and tracking. By integrating SAM2 and FoundationPose, we effectively mitigated common limitations such as the problem of tracking loss, ensuring continuous and accurate tracking under challenging conditions like occlusion or rapid movement.

SuperPose: Improved 6D Pose Estimation with Robust Tracking and Mask-Free Initialization

Abstract

Paper Structure (22 sections, 7 equations, 8 figures, 3 tables)

This paper contains 22 sections, 7 equations, 8 figures, 3 tables.

Introduction
Related Work
CAD Model-Based Object Pose Estimation
Instance Segmentation
Feature Point Matching
Proposed Method
Integration and Testing of Methods
Initial Target Identification via Manual Selection or segmented Image Input
Segmentation Prompt Generation Based on User Input or Feature Matching
6D Pose Estimation
Handling Tracking Loss in FoundationPose
Addressing Long-Term Object Loss: A Memory Mechanism
Experiment
Deployment and Evaluation of FoundationPose
CNOS and CNOS+FoundationPose
...and 7 more sections

Figures (8)

Figure 1: System workflow The image shows the implementation process of the entire system. In our system, the object's positional information within the real image is initially obtained either through user clicks or by employing LightGlue to perform feature matching between the segmented image and the real image, thus providing a positional prompt. This information is then transmitted to SAM2. If a segmented image of the object does not exist, one is generated and stored in memory. Simultaneously, a mask matrix for the object is created. Subsequently, by integrating the generated mask, the object's CAD model, and the real frame, FoundationPose is ultimately utilized to perform 6D pose estimation.
Figure 2: The initialization via manual selection or segmented image input The image illustrates the initialization process of our system. In the first method, users simply click on the frame, which sends a prompt to SAM2. This generates a mask that is passed to FoundationPose to produce a pose estimation and a segmented image, which is then stored for the memory mechanism. In the second method, users provide a segmented image to LightGlue, which generates a prompt for SAM2. SAM2 then produces a mask, which is used by FoundationPose to generate the estimation.
Figure 3: The Process of Robust tracking The image illustrates the process of robust tracking. In the first step, the frame is simultaneously sent to FoundationPose to obtain the pose estimation and to SAM2 to generate a mask. Afterward, the robust Lorentzian centroid distance is calculated. If this distance exceeds a predefined threshold, the system re-registers to solve the problem of tracking loss.
Figure 4: The process of memory mechanism The image depicts the memory mechanism process when an object is lost in the frame for an extended period. This is determined by measuring the difference between the area of the maximum contour and the initial contour. If the object remains lost for a specified duration, the segmented image is reloaded into LightGlue, which then passes a prompt to SAM2 to generate a mask for FoundationPose.
Figure 5: The initial reference images generated from CNOS The results demonstrate that CNOS initially generates 42 reference images from the CAD model as prompts, which are then passed to SAM.
...and 3 more figures

SuperPose: Improved 6D Pose Estimation with Robust Tracking and Mask-Free Initialization

Abstract

SuperPose: Improved 6D Pose Estimation with Robust Tracking and Mask-Free Initialization

Authors

Abstract

Table of Contents

Figures (8)