Table of Contents
Fetching ...

AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation

Yuanwen Yue, Sabarinath Mahadevan, Jonas Schult, Francis Engelmann, Bastian Leibe, Konrad Schindler, Theodora Kontogianni

TL;DR

AGILE3D tackles interactive 3D instance segmentation by enabling simultaneous segmentation of multiple objects in a point cloud. The method encodes user feedback as spatial-temporal click queries and propagates it through a click attention module that allows click-to-scene, click-to-click, and scene-to-click interactions, with a lightweight decoder updating masks per iteration. Trained with an iterative multi-object simulation strategy and evaluated across ScanNetV2, S3DIS, and KITTI-360, AGILE3D achieves state-of-the-art performance for both single-object and multi-object tasks and shows strong generalization to unseen domains. Real-user studies corroborate practicality in real annotation tasks, including outdoor LiDAR data, and ablations underscore the critical role of the spatial-temporal encoding and attention design. Overall, AGILE3D enables efficient, accurate, and scalable interactive segmentation for complex 3D scenes.

Abstract

During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. Our core idea is to encode user clicks as spatial-temporal queries and enable explicit interactions between click queries as well as between them and the 3D scene through a click attention module. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different 3D point cloud datasets, AGILE3D sets a new state-of-the-art. Moreover, we also verify its practicality in real-world setups with real user studies.

AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation

TL;DR

AGILE3D tackles interactive 3D instance segmentation by enabling simultaneous segmentation of multiple objects in a point cloud. The method encodes user feedback as spatial-temporal click queries and propagates it through a click attention module that allows click-to-scene, click-to-click, and scene-to-click interactions, with a lightweight decoder updating masks per iteration. Trained with an iterative multi-object simulation strategy and evaluated across ScanNetV2, S3DIS, and KITTI-360, AGILE3D achieves state-of-the-art performance for both single-object and multi-object tasks and shows strong generalization to unseen domains. Real-user studies corroborate practicality in real annotation tasks, including outdoor LiDAR data, and ablations underscore the critical role of the spatial-temporal encoding and attention design. Overall, AGILE3D enables efficient, accurate, and scalable interactive segmentation for complex 3D scenes.

Abstract

During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. Our core idea is to encode user clicks as spatial-temporal queries and enable explicit interactions between click queries as well as between them and the 3D scene through a click attention module. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different 3D point cloud datasets, AGILE3D sets a new state-of-the-art. Moreover, we also verify its practicality in real-world setups with real user studies.
Paper Structure (30 sections, 3 equations, 17 figures, 13 tables)

This paper contains 30 sections, 3 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Architecture comparison.Left: InterObject3D kontogianni2022interactive. Right: our AGILE3D. Given the same set of 10 user clicks, AGILE3D can effectively segment three objects while InteObject3D can only segment one. InterObject3D takes 0.5s for one object while we need only 0.25s for all three, thanks to running a lightweight decoder per iteration and not a full forward pass through the entire network. Unlike InterObject3D, our backbone learns features for all objects.
  • Figure 2: Model of AGILE3D. Given a 3D scene and a user click sequence, (a) the feature backbone extracts per-point features and (b) the click-as-query module converts user clicks to high-dimensional query vectors. (c) The click attention module refines the click queries and point features through multiple attention mechanisms. (d) The query fusion module first fuses the per-click mask logits to region-specific mask logits and then produces a final mask through a softmax. With $\rightarrow$we denote the user click information and with $\rightarrow$ the scene information. Colors of clicks, click queries and segmentation masks are consistent for the same object.
  • Figure 3: Multi-object iterative training.
  • Figure 4: Open-world segmentation from ScanNet20. AGILE3D can segment new objects like statue and phone.
  • Figure 5: Qualitative results on interactive single-object segmentation.
  • ...and 12 more figures