Table of Contents
Fetching ...

OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging

Yijie Tang, Jiazhao Zhang, Yuqing Lan, Yulan Guo, Dezun Dong, Chenyang Zhu, Kai Xu

TL;DR

This work tackles online zero-shot 3D instance segmentation in incrementally reconstructed scenes by leveraging visual foundation models to provide 2D priors. It introduces OnlineAnySeg, which uses a hashed voxel volume (Vol) to lift 2D masks into a unified 3D representation and maintains a mask bank (G) with an append-only mapping to enable real-time, spatially consistent mask associations. Mask merging is driven by an online similarity framework that fuses mask overlap, semantic, and geometric features, along with third-view consensus, achieving state-of-the-art online performance (up to $O(n)$-time overlap queries) on ScanNet200 and SceneNN benchmarks, at about 15 FPS. The approach remains robust to incomplete data and outperforms several online baselines while approaching offline methods, highlighting the practical value of explicit 3D spatial associations for open-vocabulary 3D segmentation in embodied AI contexts.

Abstract

Online zero-shot 3D instance segmentation of a progressively reconstructed scene is both a critical and challenging task for embodied applications. With the success of visual foundation models (VFMs) in the image domain, leveraging 2D priors to address 3D online segmentation has become a prominent research focus. Since segmentation results provided by 2D priors often require spatial consistency to be lifted into final 3D segmentation, an efficient method for identifying spatial overlap among 2D masks is essential - yet existing methods rarely achieve this in real time, mainly limiting its use to offline approaches. To address this, we propose an efficient method that lifts 2D masks generated by VFMs into a unified 3D instance using a hashing technique. By employing voxel hashing for efficient 3D scene querying, our approach reduces the time complexity of costly spatial overlap queries from $O(n^2)$ to $O(n)$. Accurate spatial associations further enable 3D merging of 2D masks through simple similarity-based filtering in a zero-shot manner, making our approach more robust to incomplete and noisy data. Evaluated on the ScanNet and SceneNN benchmarks, our approach achieves state-of-the-art performance in online, zero-shot 3D instance segmentation with leading efficiency.

OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging

TL;DR

This work tackles online zero-shot 3D instance segmentation in incrementally reconstructed scenes by leveraging visual foundation models to provide 2D priors. It introduces OnlineAnySeg, which uses a hashed voxel volume (Vol) to lift 2D masks into a unified 3D representation and maintains a mask bank (G) with an append-only mapping to enable real-time, spatially consistent mask associations. Mask merging is driven by an online similarity framework that fuses mask overlap, semantic, and geometric features, along with third-view consensus, achieving state-of-the-art online performance (up to -time overlap queries) on ScanNet200 and SceneNN benchmarks, at about 15 FPS. The approach remains robust to incomplete data and outperforms several online baselines while approaching offline methods, highlighting the practical value of explicit 3D spatial associations for open-vocabulary 3D segmentation in embodied AI contexts.

Abstract

Online zero-shot 3D instance segmentation of a progressively reconstructed scene is both a critical and challenging task for embodied applications. With the success of visual foundation models (VFMs) in the image domain, leveraging 2D priors to address 3D online segmentation has become a prominent research focus. Since segmentation results provided by 2D priors often require spatial consistency to be lifted into final 3D segmentation, an efficient method for identifying spatial overlap among 2D masks is essential - yet existing methods rarely achieve this in real time, mainly limiting its use to offline approaches. To address this, we propose an efficient method that lifts 2D masks generated by VFMs into a unified 3D instance using a hashing technique. By employing voxel hashing for efficient 3D scene querying, our approach reduces the time complexity of costly spatial overlap queries from to . Accurate spatial associations further enable 3D merging of 2D masks through simple similarity-based filtering in a zero-shot manner, making our approach more robust to incomplete and noisy data. Evaluated on the ScanNet and SceneNN benchmarks, our approach achieves state-of-the-art performance in online, zero-shot 3D instance segmentation with leading efficiency.

Paper Structure

This paper contains 29 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We propose an online zero-shot 3D segmentation method that establishes precise spatial associations between VFM-generated 2D masks from sequentially captured frames. We demonstrate an efficient merging process for masks detected from various viewpoints, enabling robust and consistent 3D instance segmentation in real time.
  • Figure 2: Overall pipeline. (a) A posed RGB-D stream is input to our method sequentially. (b) A series of 2D masks are generated by VFM from the input color image and back-projected into 3D space, establishing associations with the VoxelHashing scene representation. Meanwhile, semantic and geometric features of the masks are extracted from pre-trained feature extractors and, together with mask overlap associations, serve as the core criteria for the Mask Merging process. (c) The final prediction of 3D instances is then output.
  • Figure 3: The dynamically synchronized mapping table. The mapping table is updated during the Mask Merging stage and facilitates efficient query in the Query stage, allowing the hash table to remain append-only.
  • Figure 4: Intermediate instance segmentation results, displayed on each method’s reconstructed mesh or point cloud.
  • Figure 5: Open-vocabulary instance retrieval with varied query texts during the scanning process.
  • ...and 1 more figures