Table of Contents
Fetching ...

Ditto in the House: Building Articulation Models of Indoor Scenes through Interactive Perception

Cheng-Chun Hsu, Zhenyu Jiang, Yuke Zhu

TL;DR

The paper addresses building articulation models of indoor scenes by enabling a robot to interact with objects, moving beyond static scene reconstruction to infer kinematic properties. It introduces Ditto in the House, a pipeline that jointly learns visual affordances and articulation models through interactive perception and an iterative refinement loop, leveraging simulation data from CubiCasa5K with the iGibson framework and real-world kitchen demonstrations. Key contributions include scene-level hotspot discovery, an articulation-aware network that uses before/after observations with contact heatmaps, and ablation studies showing gains in discovery and accuracy, including a 40% increase in parts discovered and a 45% IoU improvement. The approach demonstrates practical applicability to real environments, suggesting scalable articulation modeling for robot manipulation in everyday indoor settings.

Abstract

Virtualizing the physical world into virtual models has been a critical technique for robot navigation and planning in the real world. To foster manipulation with articulated objects in everyday life, this work explores building articulation models of indoor scenes through a robot's purposeful interactions in these scenes. Prior work on articulation reasoning primarily focuses on siloed objects of limited categories. To extend to room-scale environments, the robot has to efficiently and effectively explore a large-scale 3D space, locate articulated objects, and infer their articulations. We introduce an interactive perception approach to this task. Our approach, named Ditto in the House, discovers possible articulated objects through affordance prediction, interacts with these objects to produce articulated motions, and infers the articulation properties from the visual observations before and after each interaction. It tightly couples affordance prediction and articulation inference to improve both tasks. We demonstrate the effectiveness of our approach in both simulation and real-world scenes. Code and additional results are available at https://ut-austin-rpl.github.io/HouseDitto/

Ditto in the House: Building Articulation Models of Indoor Scenes through Interactive Perception

TL;DR

The paper addresses building articulation models of indoor scenes by enabling a robot to interact with objects, moving beyond static scene reconstruction to infer kinematic properties. It introduces Ditto in the House, a pipeline that jointly learns visual affordances and articulation models through interactive perception and an iterative refinement loop, leveraging simulation data from CubiCasa5K with the iGibson framework and real-world kitchen demonstrations. Key contributions include scene-level hotspot discovery, an articulation-aware network that uses before/after observations with contact heatmaps, and ablation studies showing gains in discovery and accuracy, including a 40% increase in parts discovered and a 45% IoU improvement. The approach demonstrates practical applicability to real environments, suggesting scalable articulation modeling for robot manipulation in everyday indoor settings.

Abstract

Virtualizing the physical world into virtual models has been a critical technique for robot navigation and planning in the real world. To foster manipulation with articulated objects in everyday life, this work explores building articulation models of indoor scenes through a robot's purposeful interactions in these scenes. Prior work on articulation reasoning primarily focuses on siloed objects of limited categories. To extend to room-scale environments, the robot has to efficiently and effectively explore a large-scale 3D space, locate articulated objects, and infer their articulations. We introduce an interactive perception approach to this task. Our approach, named Ditto in the House, discovers possible articulated objects through affordance prediction, interacts with these objects to produce articulated motions, and infers the articulation properties from the visual observations before and after each interaction. It tightly couples affordance prediction and articulation inference to improve both tasks. We demonstrate the effectiveness of our approach in both simulation and real-world scenes. Code and additional results are available at https://ut-austin-rpl.github.io/HouseDitto/
Paper Structure (19 sections, 5 figures, 2 tables)

This paper contains 19 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Building scene-level articulation models through interactive perception. From an initial observation of an indoor scene, our approach infers the interaction hotspots, guiding the robot to interact with articulated objects. After that, the robot collects the observations before and after the interactions. Based on the observed articulated motions, it builds the articulation models of individual objects and aggregates them into a scene-level articulation model.
  • Figure 2: Overview of model components. Our approach consists of two stages --- affordance prediction and articulation inference. During affordance prediction, we pass the static scene point cloud into the affordance network and predict the scene-level affordance map. By applying point non-maximum suppression (NMS), we extract the interaction hotspots from the affordance map. Then, the robot interacts with the object based on those contact points. During articulation inference, we feed the point cloud observations before and after each interaction into the articulation model network to obtain articulation estimation. By aggregating the estimated articulation models, we build the articulation models of the entire scene.
  • Figure 3: Iterative refinement of affordance and articulation. In the initial stage, the object is partially opened due to the imprecise affordance prediction, which results in an inaccurate articulation estimation. In the refining stage, we refine object affordance based on the previous articulation estimation. The consequent new interaction fully opens the object and reveals the interior surface, thus improving the articulation estimation.
  • Figure 4: Qualitative results on virtual scenes. Static parts are colored grey, and mobile parts green. The estimated joints are shown as red arrows.
  • Figure 5: Reconstructed articulation model of the real scene. Static parts are colored grey, and mobile parts are green. The estimated joints are visualized with blue arrows. More results on the same scene are shown in Figure \ref{['fig: teaser']}.