Table of Contents
Fetching ...

RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation

Hanxiao Jiang, Binghao Huang, Ruihai Wu, Zhuoran Li, Shubham Garg, Hooshang Nayyeri, Shenlong Wang, Yunzhu Li

TL;DR

RoboEXP tackles the challenge of interactive scene exploration by constructing an action-conditioned 3D scene graph (ACSG) that encodes both spatial structure and action-dependent relationships. The system integrates perception, memory, decision-making, and action modules powered by a Large Multimodal Model to autonomously explore and incrementally build the ACSG, enabling robust manipulation across rigid, articulated, nested, and deformable objects. Experiments in tabletop and room settings show RoboEXP outperforms GPT-4V baselines in constructing complete ACSGs and guiding downstream tasks, with strong resilience to occlusion and intervention. The ACSG provides a principled, scalable representation for planning and executing complex manipulation in unknown environments, paving the way for practical household and office robotics.

Abstract

We introduce the novel task of interactive scene exploration, wherein robots autonomously explore environments and produce an action-conditioned scene graph (ACSG) that captures the structure of the underlying environment. The ACSG accounts for both low-level information (geometry and semantics) and high-level information (action-conditioned relationships between different entities) in the scene. To this end, we present the Robotic Exploration (RoboEXP) system, which incorporates the Large Multimodal Model (LMM) and an explicit memory design to enhance our system's capabilities. The robot reasons about what and how to explore an object, accumulating new information through the interaction process and incrementally constructing the ACSG. Leveraging the constructed ACSG, we illustrate the effectiveness and efficiency of our RoboEXP system in facilitating a wide range of real-world manipulation tasks involving rigid, articulated objects, nested objects, and deformable objects.

RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation

TL;DR

RoboEXP tackles the challenge of interactive scene exploration by constructing an action-conditioned 3D scene graph (ACSG) that encodes both spatial structure and action-dependent relationships. The system integrates perception, memory, decision-making, and action modules powered by a Large Multimodal Model to autonomously explore and incrementally build the ACSG, enabling robust manipulation across rigid, articulated, nested, and deformable objects. Experiments in tabletop and room settings show RoboEXP outperforms GPT-4V baselines in constructing complete ACSGs and guiding downstream tasks, with strong resilience to occlusion and intervention. The ACSG provides a principled, scalable representation for planning and executing complex manipulation in unknown environments, paving the way for practical household and office robotics.

Abstract

We introduce the novel task of interactive scene exploration, wherein robots autonomously explore environments and produce an action-conditioned scene graph (ACSG) that captures the structure of the underlying environment. The ACSG accounts for both low-level information (geometry and semantics) and high-level information (action-conditioned relationships between different entities) in the scene. To this end, we present the Robotic Exploration (RoboEXP) system, which incorporates the Large Multimodal Model (LMM) and an explicit memory design to enhance our system's capabilities. The robot reasons about what and how to explore an object, accumulating new information through the interaction process and incrementally constructing the ACSG. Leveraging the constructed ACSG, we illustrate the effectiveness and efficiency of our RoboEXP system in facilitating a wide range of real-world manipulation tasks involving rigid, articulated objects, nested objects, and deformable objects.
Paper Structure (29 sections, 13 figures, 5 tables, 3 algorithms)

This paper contains 29 sections, 13 figures, 5 tables, 3 algorithms.

Figures (13)

  • Figure 1: Interactive Exploration to Construct an Action-Conditioned Scene Graph (ACSG) for Robotic Manipulation. (a) Exploration: The robot autonomously explores by interacting with the environment to generate a comprehensive ACSG. This graph is used to catalog the locations and relationships of items. (b) Exploitation: Utilizing the constructed scene graph, the robot completes downstream tasks by efficiently organizing the necessary items according to the desired spatial and relational constraints.
  • Figure 2: Action-Conditioned 3D Scene Graph from Interactive Scene Exploration. We depict a scenario wherein a robot arm explores a tabletop scene containing two cabinets and a condiment obstructing the left door. (a) The robot arm actively interacts with the scene, completing the interactive scene exploration process. (b) We showcase the corresponding low-level memory in our ACSG. The small graph on the bottom-left of each visualization represents a segment of the final scene graph. (c) We present the high-level memory of our ACSG. The graph reveals that picking up the condiment serves as a precondition for opening the door, and opening the bottom drawer allows the observation of the concealed banana.
  • Figure 3: Overview of Our RoboEXP System. We present a comprehensive overview of our RoboEXP system, comprised of four modules: (a) perception, (b) memory, (c) decision-making and (d) action module.
  • Figure 4: Visualization of Quantitative Results. (a) The action-object graph captures the change in the number of discovered objects relative to the number of actions taken. Our RoboEXP efficiently discovers all objects. (b) The error breakdown of all our quantitative experiments includes 5 task settings with 10 variations each. We categorize errors into perception, decision, action, and no-error cases. For the GPT-4V baseline, we manually assist in action execution, eliminating action errors. This serves as an upper bound for baseline performance. However, even with this enhancement, our RoboEXP consistently shows superior performance.
  • Figure 5: Qualitative Results on Different Scenarios. We visualize the interactive exploration process and the corresponding constructed ACSG. (a) This scenario involves a tabletop environment with two articulated objects, accompanied by additional items either on the table or concealed in storage space. The constructed scene graph demonstrates the success of our system in identifying all objects within the environment through a series of physical interactions. (b) This scenario includes nested objects, five Matryoshka dolls, with only the top one being directly observable. Our system autonomously decides to explore the contents through a recursive reasoning process, showcasing its ability to construct deep ACSG. (c) This scenario involves a fabric covering a mouse, showcasing exploration scenarios that involve a deformable object. Our system interacts with the fabric and successfully uncovers what lies beneath it.
  • ...and 8 more figures