Locate n' Rotate: Two-stage Openable Part Detection with Foundation Model Priors
Siqi Li, Xiaoxue Chen, Haoyu Cheng, Guyue Zhou, Hao Zhao, Guanzhong Tian
TL;DR
This work addresses Openable Part Detection (OPD) from single-view RGB images, requiring both object-category understanding and motion parameter estimation. It introduces MOPD, a two-stage Transformer framework that injects perceptual grouping priors and geometric priors through separate encoders and cross-attention fusion, enabling better generalization to unseen objects. Training employs an Optimal Transport Assignment (OTA) with motion-aware costs, and a loss L that combines segmentation and motion terms: $L = L_{seg} + L_{mot}$, where $L_{seg} = \lambda_{cd} L_{ce} + \lambda_{dice} L_{dice} + \lambda_{cls} L_{cls} + \lambda_{c} L_c$ and $L_{mot} = \lambda_{a} L_a + \lambda_{o} L_o + \lambda_{pose} L_{pose}$. Empirical results on OPDMulti show that MOPD improves part-detection mAP by up to 4.9% and motion-parameter accuracy by about 1.2% over strong baselines, with ablations confirming the contributions of perceptual/geometric priors and OTA; the approach is efficient enough for real-time robotics applications.
Abstract
Detecting the openable parts of articulated objects is crucial for downstream applications in intelligent robotics, such as pulling a drawer. This task poses a multitasking challenge due to the necessity of understanding object categories and motion. Most existing methods are either category-specific or trained on specific datasets, lacking generalization to unseen environments and objects. In this paper, we propose a Transformer-based Openable Part Detection (OPD) framework named Multi-feature Openable Part Detection (MOPD) that incorporates perceptual grouping and geometric priors, outperforming previous methods in performance. In the first stage of the framework, we introduce a perceptual grouping feature model that provides perceptual grouping feature priors for openable part detection, enhancing detection results through a cross-attention mechanism. In the second stage, a geometric understanding feature model offers geometric feature priors for predicting motion parameters. Compared to existing methods, our proposed approach shows better performance in both detection and motion parameter prediction. Codes and models are publicly available at https://github.com/lisiqi-zju/MOPD
