Table of Contents
Fetching ...

STORM: Segment, Track, and Object Re-Localization from a Single Image

Yu Deng, Teng Cao, Hikaru Shindo, Jiahong Xue, Quentin Delfosse, Kristian Kersting

TL;DR

STORM tackles the deployment bottleneck of 6D pose estimation and tracking by delivering an annotation-free pipeline that uses reference images to generate masks and object-centric 3D models. It integrates a segmentation module (SOM) with Hierarchical Spatial Fusion Attention and a SAM3D-based 3D reconstruction, followed by a tracking module (TOM) that detects failures via a memory-based verifier and re-initializes as needed. The approach achieves state-of-the-art annotation-free performance on challenging benchmarks, with real-time speeds and strong robustness to occlusion and fast motion. Key innovations include language-guided semantic prompts, HSFA for cross-domain feature fusion, and a lightweight tracking-loss classifier that enables reliable automatic re-registration.

Abstract

Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically require a pre-defined 3D model of the target and rely on a manually annotated segmentation mask in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limitations, we propose STORM (Segment, Track, and Object Re-localization from a single iMage), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and produce precise masks and 3D models for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.

STORM: Segment, Track, and Object Re-Localization from a Single Image

TL;DR

STORM tackles the deployment bottleneck of 6D pose estimation and tracking by delivering an annotation-free pipeline that uses reference images to generate masks and object-centric 3D models. It integrates a segmentation module (SOM) with Hierarchical Spatial Fusion Attention and a SAM3D-based 3D reconstruction, followed by a tracking module (TOM) that detects failures via a memory-based verifier and re-initializes as needed. The approach achieves state-of-the-art annotation-free performance on challenging benchmarks, with real-time speeds and strong robustness to occlusion and fast motion. Key innovations include language-guided semantic prompts, HSFA for cross-domain feature fusion, and a lightweight tracking-loss classifier that enables reliable automatic re-registration.

Abstract

Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically require a pre-defined 3D model of the target and rely on a manually annotated segmentation mask in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limitations, we propose STORM (Segment, Track, and Object Re-localization from a single iMage), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and produce precise masks and 3D models for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.

Paper Structure

This paper contains 57 sections, 19 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Pose-estimation models lack robustness, exemplified with FoundationPose wen2024foundationpose, that fails to detect a mug under camera pose variation, highlighting its sensitivity to viewpoint shifts.
  • Figure 2: Overview of STORM, which is composed of two subsystems: the Segmenting Object Module (SOM) and the Tracking Object Module (TOM). SOM leverages reference images to generate a 3D model and, using their semantic and spatial information, integrates both intra-image and inter-image attention modules to capture spatial cues of the query frame through local and global attention blocks, producing a segmented mask. TOM classifies the output of the tracking module and utilizes the memory of SOM to perform re-registration when tracking fails, thereby ensuring robust and continuous tracking performance.
  • Figure 3: STORM (SOM+TOM) achieves robust pose estimation for occluded objects in complex scenes. We compare pose-estimation qualities on the LMO and YCB-V datasets, which comprise complex scenes with multiple and possibly occluded objects. As baselines, CNOS and GroundTruth are used to predict the segmentation mask, and FoundationPose was used to produce the pose estimation. The results indicate that our method produces pose estimates that are quantitatively and qualitatively close to the ground truth annotations in these scenarios. Detected objects are highlighted in green and pink.
  • Figure 4: Comparison of 3D models reconstructed from reference images and ground-truth 3D models. The Aligned SAM3D models almost perfectly recover the underlying object structure while producing smoother and more regular contours along object boundaries.
  • Figure 5: STORM automatically recovers from tracking failures. A demonstration of the Tracking Object Module (TOM) successfully re-tracking a lost object. In contrast, FoundationPose fails to recover once tracking is lost.
  • ...and 4 more figures