OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging
Yijie Tang, Jiazhao Zhang, Yuqing Lan, Yulan Guo, Dezun Dong, Chenyang Zhu, Kai Xu
TL;DR
This work tackles online zero-shot 3D instance segmentation in incrementally reconstructed scenes by leveraging visual foundation models to provide 2D priors. It introduces OnlineAnySeg, which uses a hashed voxel volume (Vol) to lift 2D masks into a unified 3D representation and maintains a mask bank (G) with an append-only mapping to enable real-time, spatially consistent mask associations. Mask merging is driven by an online similarity framework that fuses mask overlap, semantic, and geometric features, along with third-view consensus, achieving state-of-the-art online performance (up to $O(n)$-time overlap queries) on ScanNet200 and SceneNN benchmarks, at about 15 FPS. The approach remains robust to incomplete data and outperforms several online baselines while approaching offline methods, highlighting the practical value of explicit 3D spatial associations for open-vocabulary 3D segmentation in embodied AI contexts.
Abstract
Online zero-shot 3D instance segmentation of a progressively reconstructed scene is both a critical and challenging task for embodied applications. With the success of visual foundation models (VFMs) in the image domain, leveraging 2D priors to address 3D online segmentation has become a prominent research focus. Since segmentation results provided by 2D priors often require spatial consistency to be lifted into final 3D segmentation, an efficient method for identifying spatial overlap among 2D masks is essential - yet existing methods rarely achieve this in real time, mainly limiting its use to offline approaches. To address this, we propose an efficient method that lifts 2D masks generated by VFMs into a unified 3D instance using a hashing technique. By employing voxel hashing for efficient 3D scene querying, our approach reduces the time complexity of costly spatial overlap queries from $O(n^2)$ to $O(n)$. Accurate spatial associations further enable 3D merging of 2D masks through simple similarity-based filtering in a zero-shot manner, making our approach more robust to incomplete and noisy data. Evaluated on the ScanNet and SceneNN benchmarks, our approach achieves state-of-the-art performance in online, zero-shot 3D instance segmentation with leading efficiency.
