Table of Contents
Fetching ...

Opening Articulated Structures in the Real World

Arjun Gupta, Michelle Zhang, Rishik Sathua, Saurabh Gupta

TL;DR

The paper tackles the problem of generalizing mobile manipulation to unseen objects in unseen environments by introducing MOSART, a modular system that combines on-board perception (APM), whole-body motion planning (SeqIK), and proprioceptive adaptation for zero-shot opening of articulated structures. Through a large-scale real-world evaluation across 13 sites and 31 objects, the authors show that a modular approach outperforms end-to-end imitation learning even when the latter is trained on thousands of demonstrations, and identify perception as the main bottleneck. Key contributions include the Articulation-parameter Prediction Module (APM), a two-stage RGB-D approach with 3D lifting, and a contact-based adaptation strategy that improves last-centimeter grasping. The work provides a pragmatic roadmap for system-level research in generalizable mobile manipulation and highlights concrete directions to enhance perception and grasping robustness in real-world deployments.

Abstract

What does it take to build mobile manipulation systems that can competently operate on previously unseen objects in previously unseen environments? This work answers this question using opening of articulated structures as a mobile manipulation testbed. Specifically, our focus is on the end-to-end performance on this task without any privileged information, i.e. the robot starts at a location with the novel target articulated object in view, and has to approach the object and successfully open it. We first develop a system for this task, and then conduct 100+ end-to-end system tests across 13 real world test sites. Our large-scale study reveals a number of surprising findings: a) modular systems outperform end-to-end learned systems for this task, even when the end-to-end learned systems are trained on 1000+ demonstrations, b) perception, and not precise end-effector control, is the primary bottleneck to task success, and c) state-of-the-art articulation parameter estimation models developed in isolation struggle when faced with robot-centric viewpoints. Overall, our findings highlight the limitations of developing components of the pipeline in isolation and underscore the need for system-level research, providing a pragmatic roadmap for building generalizable mobile manipulation systems. Videos, code, and models are available on the project website: https://arjung128.github.io/opening-articulated-structures/

Opening Articulated Structures in the Real World

TL;DR

The paper tackles the problem of generalizing mobile manipulation to unseen objects in unseen environments by introducing MOSART, a modular system that combines on-board perception (APM), whole-body motion planning (SeqIK), and proprioceptive adaptation for zero-shot opening of articulated structures. Through a large-scale real-world evaluation across 13 sites and 31 objects, the authors show that a modular approach outperforms end-to-end imitation learning even when the latter is trained on thousands of demonstrations, and identify perception as the main bottleneck. Key contributions include the Articulation-parameter Prediction Module (APM), a two-stage RGB-D approach with 3D lifting, and a contact-based adaptation strategy that improves last-centimeter grasping. The work provides a pragmatic roadmap for system-level research in generalizable mobile manipulation and highlights concrete directions to enhance perception and grasping robustness in real-world deployments.

Abstract

What does it take to build mobile manipulation systems that can competently operate on previously unseen objects in previously unseen environments? This work answers this question using opening of articulated structures as a mobile manipulation testbed. Specifically, our focus is on the end-to-end performance on this task without any privileged information, i.e. the robot starts at a location with the novel target articulated object in view, and has to approach the object and successfully open it. We first develop a system for this task, and then conduct 100+ end-to-end system tests across 13 real world test sites. Our large-scale study reveals a number of surprising findings: a) modular systems outperform end-to-end learned systems for this task, even when the end-to-end learned systems are trained on 1000+ demonstrations, b) perception, and not precise end-effector control, is the primary bottleneck to task success, and c) state-of-the-art articulation parameter estimation models developed in isolation struggle when faced with robot-centric viewpoints. Overall, our findings highlight the limitations of developing components of the pipeline in isolation and underscore the need for system-level research, providing a pragmatic roadmap for building generalizable mobile manipulation systems. Videos, code, and models are available on the project website: https://arjung128.github.io/opening-articulated-structures/
Paper Structure (41 sections, 17 figures, 6 tables)

This paper contains 41 sections, 17 figures, 6 tables.

Figures (17)

  • Figure 1: MOSART Design. The perception module outputs 3D articulation parameters in the robot frame using RGB-D images. The robot then navigates to the target location based on articulation type. Next, we use SeqIK to find a whole-body motion plan. We execute the first robot configuration from the plan to obtain a pre-grasp pose. We then use our contact-based adaptation for improved grasping. Once the handle is grasped, we execute the rest of the plan.
  • Figure 2: Overview of the Articulation-parameter Prediction Module (APM). Given an RGB image our modified Mask RCNN detects articulated objects and predicts the articulation type, the handle orientation, the 2D segmentation mask, and the 2D handle keypoint. We fit a convex hull to the segmentation mask and simplify it to a quadrilateral. We fit a plane to the depth image points that lie inside the mask to estimate the surface normal. The 2D handle and quadrilateral corners are lifted to 3D using the depth image. All predictions are transformed to the robot base frame. The final output of the module includes the 3D handle coordinate and surface normal in the base coordinate frame for all articulated objects, and additionally the radius and rotation axis for cabinets.
  • Figure 3: Topdown Navigation Targets and Corrective Motions. We show the topdown navigation targets relative to the handle for each articulation type. For left-hinged cabinets, correction is a rotation in $1^\circ$ increments. For the other objects, we extend the arm in $1cm$ increments.
  • Figure 4: Contact Correction. (a) shows a grasping attempt with No contact correction, whereas (b) shows the grasping attempt with contact correction. Due to compounding errors and shrinkage of the gripper when closing, the without contact correction version fails to grasp, whereas our contact-based correction mechanism leads to a successful grasp.
  • Figure 5: Comparison to OPDMulti sun2023opdmulti. We perform a qualitative comparison to OPDMulti sun2023opdmulti (appeared in 3DV 2024) on the same six images presented in Figure \ref{['fig:dummy-maskrcnn_supp']}. OPDMulti fails in various ways: a) segmentation masks bleed outside of the object, and b) merging of multiple objects into one. Our model produces more accurate segmentation masks (which affect the surface normal, and thus the navigation).
  • ...and 12 more figures