Table of Contents
Fetching ...

Locate n' Rotate: Two-stage Openable Part Detection with Foundation Model Priors

Siqi Li, Xiaoxue Chen, Haoyu Cheng, Guyue Zhou, Hao Zhao, Guanzhong Tian

TL;DR

This work addresses Openable Part Detection (OPD) from single-view RGB images, requiring both object-category understanding and motion parameter estimation. It introduces MOPD, a two-stage Transformer framework that injects perceptual grouping priors and geometric priors through separate encoders and cross-attention fusion, enabling better generalization to unseen objects. Training employs an Optimal Transport Assignment (OTA) with motion-aware costs, and a loss L that combines segmentation and motion terms: $L = L_{seg} + L_{mot}$, where $L_{seg} = \lambda_{cd} L_{ce} + \lambda_{dice} L_{dice} + \lambda_{cls} L_{cls} + \lambda_{c} L_c$ and $L_{mot} = \lambda_{a} L_a + \lambda_{o} L_o + \lambda_{pose} L_{pose}$. Empirical results on OPDMulti show that MOPD improves part-detection mAP by up to 4.9% and motion-parameter accuracy by about 1.2% over strong baselines, with ablations confirming the contributions of perceptual/geometric priors and OTA; the approach is efficient enough for real-time robotics applications.

Abstract

Detecting the openable parts of articulated objects is crucial for downstream applications in intelligent robotics, such as pulling a drawer. This task poses a multitasking challenge due to the necessity of understanding object categories and motion. Most existing methods are either category-specific or trained on specific datasets, lacking generalization to unseen environments and objects. In this paper, we propose a Transformer-based Openable Part Detection (OPD) framework named Multi-feature Openable Part Detection (MOPD) that incorporates perceptual grouping and geometric priors, outperforming previous methods in performance. In the first stage of the framework, we introduce a perceptual grouping feature model that provides perceptual grouping feature priors for openable part detection, enhancing detection results through a cross-attention mechanism. In the second stage, a geometric understanding feature model offers geometric feature priors for predicting motion parameters. Compared to existing methods, our proposed approach shows better performance in both detection and motion parameter prediction. Codes and models are publicly available at https://github.com/lisiqi-zju/MOPD

Locate n' Rotate: Two-stage Openable Part Detection with Foundation Model Priors

TL;DR

This work addresses Openable Part Detection (OPD) from single-view RGB images, requiring both object-category understanding and motion parameter estimation. It introduces MOPD, a two-stage Transformer framework that injects perceptual grouping priors and geometric priors through separate encoders and cross-attention fusion, enabling better generalization to unseen objects. Training employs an Optimal Transport Assignment (OTA) with motion-aware costs, and a loss L that combines segmentation and motion terms: , where and . Empirical results on OPDMulti show that MOPD improves part-detection mAP by up to 4.9% and motion-parameter accuracy by about 1.2% over strong baselines, with ablations confirming the contributions of perceptual/geometric priors and OTA; the approach is efficient enough for real-time robotics applications.

Abstract

Detecting the openable parts of articulated objects is crucial for downstream applications in intelligent robotics, such as pulling a drawer. This task poses a multitasking challenge due to the necessity of understanding object categories and motion. Most existing methods are either category-specific or trained on specific datasets, lacking generalization to unseen environments and objects. In this paper, we propose a Transformer-based Openable Part Detection (OPD) framework named Multi-feature Openable Part Detection (MOPD) that incorporates perceptual grouping and geometric priors, outperforming previous methods in performance. In the first stage of the framework, we introduce a perceptual grouping feature model that provides perceptual grouping feature priors for openable part detection, enhancing detection results through a cross-attention mechanism. In the second stage, a geometric understanding feature model offers geometric feature priors for predicting motion parameters. Compared to existing methods, our proposed approach shows better performance in both detection and motion parameter prediction. Codes and models are publicly available at https://github.com/lisiqi-zju/MOPD

Paper Structure

This paper contains 16 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of network architecture between our framework and MultiOPD. The outputs are the results on an in-the-wild image. Our model achieves superior performance and showcases generation capabilities to unseen scenarios.
  • Figure 2: The overall architecture for MOPD. The top side shows the overall network while the bottom shows the decoder in detail. The model employs three encoders to extract the features from the images. The pixel-level embeddings from the encoder are passed to the transformer decoder with learnable part queries to learn embeddings that are used to predict the openable part. The OPD feature and perceptual grouping feature are successively crossed in the segmentation decoder to obtain a high-resolution mask. In the same way, the OPD feature and geometric feature are used in the motion decoder. The motion type, part type, and mask are predicted in all FFN layers of the semantic segmentation decoder, while the object poses, origin, and axis are predicted in all FFN layers of the motion decoder. The image on the far right shows the output. The GT axis is in blue and the predicted axis is in red.
  • Figure 3: Qualitative results on the OPDMulti and MOPD val split. The first two rows are a comparison of MOPD variants with OPDMulti in valid dataset. The last rows are a comparison in the wild. The GT axis is in blue and the predicted axis is in green if it is within ${5^{\circ}}$ of the GT,orange if between ${5^{\circ}}$and ${10^{\circ}}$ and red if the angle difference is greater than ${10^{\circ}}$.
  • Figure 4: At the top, there is a comparison of the results obtained from the w/o perceptual grouping encoder in MOPD. At the bottom, there is the output from EfficientSAM, which we utilized to pre-train the encoder. The figure demonstrate that our model indeed utilizes the pre-trained encoder. Since the DETR is a query-based model, it can occasionally detect two distinct objects as a single entity. However, by leveraging the segmentation capabilities inherent in the EfficientSAM model, we are able to effectively mitigate such errors and improve the overall accuracy of detections. The quantitative result are shown in Table \ref{['table:efficiency1']}.
  • Figure 5: At the top, there is a comparison of the results obtained from the w/o geometric encoder in MOPD. At the bottom, there is the output from DSINE, which we utilized to pre-train the geometric encoder. Through the plugging in of geometric features, the model corrects the axis direction to make it closer to the surface normal evaluation. It indicates that the decoder indeed takes advantage of geometric features. When comparing the two models, we can observe that our model has better precision with origin prediction, especially when the two models have a similar axis prediction. This is because the RGB picture lacks information regarding the degree of the two crossing surfaces, which makes the model unable to provide an accurate prediction of Three-dimensional coordinates when the axis is near the edge of the door and lid. The introduction of normal features can alleviate this issue by pushing the origin away from the incorrect surface.
  • ...and 2 more figures