Table of Contents
Fetching ...

Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images

Ruiqi Wang, Akshay Gadi Patil, Fenggen Yu, Hao Zhang

TL;DR

This work tackles the challenge of segmenting moveable parts in real indoor scenes with minimal manual labeling. It introduces a coarse-to-fine active-learning framework built around a pose-aware masked-attention Transformer, enabling high-accuracy 2D segmentation and semantic labeling of moveable parts while drastically reducing human effort. The approach yields over 90% segmentation accuracy on a 2,000-image real-world test set with significantly less labeling required (11.45% of images), and it provides a large, diverse 2,550-image real dataset of articulated objects. The contributions include the first AL-based framework for moveable-part segmentation, a two-stage network that leverages object and pose cues, and a dataset that advances real-world understanding of articulated objects for downstream tasks such as 3D reconstruction and manipulation.

Abstract

We introduce the first active learning (AL) model for high-accuracy instance segmentation of moveable parts from RGB images of real indoor scenes. Specifically, our goal is to obtain fully validated segmentation results by humans while minimizing manual effort. To this end, we employ a transformer that utilizes a masked-attention mechanism to supervise the active segmentation. To enhance the network tailored to moveable parts, we introduce a coarse-to-fine AL approach which first uses an object-aware masked attention and then a pose-aware one, leveraging the hierarchical nature of the problem and a correlation between moveable parts and object poses and interaction directions. When applying our AL model to 2,000 real images, we obtain fully validated moveable part segmentations with semantic labels, by only needing to manually annotate 11.45% of the images. This translates to significant (60%) time saving over manual effort required by the best non-AL model to attain the same segmentation accuracy. At last, we contribute a dataset of 2,550 real images with annotated moveable parts, demonstrating its superior quality and diversity over the best alternatives.

Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images

TL;DR

This work tackles the challenge of segmenting moveable parts in real indoor scenes with minimal manual labeling. It introduces a coarse-to-fine active-learning framework built around a pose-aware masked-attention Transformer, enabling high-accuracy 2D segmentation and semantic labeling of moveable parts while drastically reducing human effort. The approach yields over 90% segmentation accuracy on a 2,000-image real-world test set with significantly less labeling required (11.45% of images), and it provides a large, diverse 2,550-image real dataset of articulated objects. The contributions include the first AL-based framework for moveable-part segmentation, a two-stage network that leverages object and pose cues, and a dataset that advances real-world understanding of articulated objects for downstream tasks such as 3D reconstruction and manipulation.

Abstract

We introduce the first active learning (AL) model for high-accuracy instance segmentation of moveable parts from RGB images of real indoor scenes. Specifically, our goal is to obtain fully validated segmentation results by humans while minimizing manual effort. To this end, we employ a transformer that utilizes a masked-attention mechanism to supervise the active segmentation. To enhance the network tailored to moveable parts, we introduce a coarse-to-fine AL approach which first uses an object-aware masked attention and then a pose-aware one, leveraging the hierarchical nature of the problem and a correlation between moveable parts and object poses and interaction directions. When applying our AL model to 2,000 real images, we obtain fully validated moveable part segmentations with semantic labels, by only needing to manually annotate 11.45% of the images. This translates to significant (60%) time saving over manual effort required by the best non-AL model to attain the same segmentation accuracy. At last, we contribute a dataset of 2,550 real images with annotated moveable parts, demonstrating its superior quality and diversity over the best alternatives.
Paper Structure (24 sections, 1 equation, 6 figures, 5 tables)

This paper contains 24 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Our instance segmentation of moveable parts, with semantic labels, on real-world photos. Comparison is made with OPDFormer-C (OPD = openable part detection), the current state of the art, where small red $\times$s indicate erroneous or missed labels.Our method generalizes to non-openable parts, e.g., on lamps and bottles (top right). As an application of accurate moveable part segmentation, we can manipulate 3D reconstructions of articulated objects (bottom right).
  • Figure 2: Overview of our pose-aware masked attention network for moveable part segmentation of articulated objects in real scene images. Utilizing a two-stage framework, we first derive a coarse segmentation by predicting the object mask, its 6 DoF pose, and the interaction direction, subsequently isolating the interaction surface of the objects. In the fine segmentation stage, we combine the object mask and interaction surface to form a refined mask, enabling the extraction of fine-grained instance segmentation of moveable parts.
  • Figure 3: Our coarse-to-fine Active Learning (AL) training pipeline. The coarse AL applys on interaction directions and retains high-quality predictions while manually rectifying the rest. These rectified predictions form a constructive prior for refined mask prediction. Subsequently, the fine AL stage utilizes these refined masks, employing an iterative training method with continuous human intervention for accurate part mask annotation.
  • Figure 4: Qualitative results on OPDReal and OPDMulti test set. $\text{Ours}_{w/o AL}$ outperforms others on noisy GT and multiple objects. See supplementary materials for more results.
  • Figure 5: Part-level reconstruction and manipulation of the bottle and dishwasher
  • ...and 1 more figures