Table of Contents
Fetching ...

Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation

Andrea Simonelli, Norman Müller, Peter Kontschieder

TL;DR

Easy3D tackles 3D interactive instance segmentation with a simple yet powerful architecture that blends a voxel-based sparse encoder with a transformer-based decoder and implicit click fusion. The method incorporates learned negative embeddings to better distinguish background, enabling strong generalization to unseen objects and geometric distributions, including Gaussian Splatting representations. Across ScanNet, ScanNet++, S3DIS, KITTI-360, and GS-ScanNet40, Easy3D achieves state-of-the-art performance with few user clicks, demonstrating both robustness and efficiency. The work provides comprehensive ablations and qualitative demonstrations, highlighting practical applicability for VR and robotic scenarios where rapid, accurate 3D segmentation is essential.

Abstract

The increasing availability of digital 3D environments, whether through image-based 3D reconstruction, generation, or scans obtained by robots, is driving innovation across various applications. These come with a significant demand for 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and performing well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a 3D interactive segmentation method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformer-based decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet, ScanNet++, S3DIS, and KITTI-360, and also on unseen geometric distributions such as the ones obtained by Gaussian Splatting. The project web-page is available at https://simonelli-andrea.github.io/easy3d.

Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation

TL;DR

Easy3D tackles 3D interactive instance segmentation with a simple yet powerful architecture that blends a voxel-based sparse encoder with a transformer-based decoder and implicit click fusion. The method incorporates learned negative embeddings to better distinguish background, enabling strong generalization to unseen objects and geometric distributions, including Gaussian Splatting representations. Across ScanNet, ScanNet++, S3DIS, KITTI-360, and GS-ScanNet40, Easy3D achieves state-of-the-art performance with few user clicks, demonstrating both robustness and efficiency. The work provides comprehensive ablations and qualitative demonstrations, highlighting practical applicability for VR and robotic scenarios where rapid, accurate 3D segmentation is essential.

Abstract

The increasing availability of digital 3D environments, whether through image-based 3D reconstruction, generation, or scans obtained by robots, is driving innovation across various applications. These come with a significant demand for 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and performing well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a 3D interactive segmentation method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformer-based decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet, ScanNet++, S3DIS, and KITTI-360, and also on unseen geometric distributions such as the ones obtained by Gaussian Splatting. The project web-page is available at https://simonelli-andrea.github.io/easy3d.

Paper Structure

This paper contains 32 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Example of a typical use-case of Interactive 3D Instance Segmentation (top) and overview of our method's components (bottom). A user interacts with a 3D scene (top left) and defines a 3D click (blue circle) to select an object (table). The scene and clicks are fed to Easy3D, which generates the corresponding instance segmentation mask (top right). After visually inspecting the result (red mask), the user can provide feedback and refine the output with additional 3D clicks. To obtain the mask, Easy3D initially maps the scene and user click(s) to a voxelized 3D representation, to then extract scene features using a sparse U-Net. The processed scene and clicks are then fed to a two-directional transformer decoder, which exchanges information through attention to update them. Finally, an implicit click fusion operation is used to predict the output instance segmentation mask.
  • Figure 2: Architecture of our method for which we provide details in Sec. \ref{['sec:method']}. The set of user clicks $C$ (top left) is encoded into the clicks embedding $C_E$. The input scene points $S_P$ (bottom left) are mapped into a voxelized scene $S_V$ and encoded into a scene embedding $S_E$. The clicks and scene embedding, with the additional learned embeddings $L_E$ (e.g. output embeddings), are then fed to the decoder which uses attention operations to update them. The updated embeddings are then fused using a click fusion strategy to obtain the segmentation mask on voxels $M_V$, which is finally mapped back to the original scene points to obtain the output points mask $M_P$.
  • Figure 3: Depiction of explicit click fusion as in AGILE3D yue2023agile3d (top) and implicit click fusion as used in our method (bottom).
  • Figure 4: Given a target object (green mask, left) and a first click (sphere at IoU@1) we compare mask predictions (red) and IoU@i of our method and AGILE3D yue2023agile3d with $\mathsf{N_C}\leq3$. For 2nd and 3rd clicks we apply the same simulated interaction based on each method's errors.
  • Figure 5: Visualizations of a VR application which, using a consumer-grade headset, lets a user easily segment objects using Easy3D, manipulate them (top row) and make them explode (middle and bottom rows). In this application, Easy3D fits seamlessly into a Gaussian Splatting kerbl20233d rendering pipeline enabling interactive 3D segmentation in real-time. The scene is taken from xu2023vr.
  • ...and 6 more figures