STORM: Segment, Track, and Object Re-Localization from a Single Image
Yu Deng, Teng Cao, Hikaru Shindo, Jiahong Xue, Quentin Delfosse, Kristian Kersting
TL;DR
STORM tackles the deployment bottleneck of 6D pose estimation and tracking by delivering an annotation-free pipeline that uses reference images to generate masks and object-centric 3D models. It integrates a segmentation module (SOM) with Hierarchical Spatial Fusion Attention and a SAM3D-based 3D reconstruction, followed by a tracking module (TOM) that detects failures via a memory-based verifier and re-initializes as needed. The approach achieves state-of-the-art annotation-free performance on challenging benchmarks, with real-time speeds and strong robustness to occlusion and fast motion. Key innovations include language-guided semantic prompts, HSFA for cross-domain feature fusion, and a lightweight tracking-loss classifier that enables reliable automatic re-registration.
Abstract
Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically require a pre-defined 3D model of the target and rely on a manually annotated segmentation mask in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limitations, we propose STORM (Segment, Track, and Object Re-localization from a single iMage), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and produce precise masks and 3D models for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.
