Table of Contents
Fetching ...

Visual Room Rearrangement

Luca Weihs, Matt Deitke, Aniruddha Kembhavi, Roozbeh Mottaghi

TL;DR

RoomR introduces Room Rearrangement, a two-stage interactive benchmark where an agent observes a goal room, then must restore the initial configuration after objects are moved or state-changed. The RoomR dataset (6,000 rearrangements across 120 AI2-THOR rooms and 72 object types) enables two task variants (1-Phase and 2-Phase) and uses a dual-policy model with non-parametric and semantic mapping components. Baselines using DD-PPO and imitation learning reveal significant challenges, with semantic mapping providing notable gains but performance far from perfect, underscoring the need for new architectures for comparative mapping and long-horizon visual reasoning. Overall, RoomR offers a rich testbed for navigation, manipulation, and planning in visually complex, partially observable environments, pushing beyond static-perception benchmarks.

Abstract

There has been a significant recent progress in the field of Embodied AI with researchers developing models and algorithms enabling embodied agents to navigate and interact within completely unseen environments. In this paper, we propose a new dataset and baseline models for the task of Rearrangement. We particularly focus on the task of Room Rearrangement: an agent begins by exploring a room and recording objects' initial configurations. We then remove the agent and change the poses and states (e.g., open/closed) of some objects in the room. The agent must restore the initial configurations of all objects in the room. Our dataset, named RoomR, includes 6,000 distinct rearrangement settings involving 72 different object types in 120 scenes. Our experiments show that solving this challenging interactive task that involves navigation and object interaction is beyond the capabilities of the current state-of-the-art techniques for embodied tasks and we are still very far from achieving perfect performance on these types of tasks. The code and the dataset are available at: https://ai2thor.allenai.org/rearrangement

Visual Room Rearrangement

TL;DR

RoomR introduces Room Rearrangement, a two-stage interactive benchmark where an agent observes a goal room, then must restore the initial configuration after objects are moved or state-changed. The RoomR dataset (6,000 rearrangements across 120 AI2-THOR rooms and 72 object types) enables two task variants (1-Phase and 2-Phase) and uses a dual-policy model with non-parametric and semantic mapping components. Baselines using DD-PPO and imitation learning reveal significant challenges, with semantic mapping providing notable gains but performance far from perfect, underscoring the need for new architectures for comparative mapping and long-horizon visual reasoning. Overall, RoomR offers a rich testbed for navigation, manipulation, and planning in visually complex, partially observable environments, pushing beyond static-perception benchmarks.

Abstract

There has been a significant recent progress in the field of Embodied AI with researchers developing models and algorithms enabling embodied agents to navigate and interact within completely unseen environments. In this paper, we propose a new dataset and baseline models for the task of Rearrangement. We particularly focus on the task of Room Rearrangement: an agent begins by exploring a room and recording objects' initial configurations. We then remove the agent and change the poses and states (e.g., open/closed) of some objects in the room. The agent must restore the initial configurations of all objects in the room. Our dataset, named RoomR, includes 6,000 distinct rearrangement settings involving 72 different object types in 120 scenes. Our experiments show that solving this challenging interactive task that involves navigation and object interaction is beyond the capabilities of the current state-of-the-art techniques for embodied tasks and we are still very far from achieving perfect performance on these types of tasks. The code and the dataset are available at: https://ai2thor.allenai.org/rearrangement

Paper Structure

This paper contains 21 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An instance of the Room Rearrangement task. Objects begin in the positions indicated by the solid 3D bounding boxes. An agent must walk through the room and record the objects it sees. The agent is then removed, and objects are moved to the locations indicated by the dashed bounding boxes. The agent is then reintroduced into the room and must interact with objects (moving or opening them) to return the room to its original state.
  • Figure 2: Distance distribution. The horizontal (Manhattan distance) and vertical distance distributions between changed objects in their goal and initial positions.
  • Figure 3: Distribution of object size. Each column contains the cube root of every object's bounding box volume that may change in openness (red) or position (blue) for a particular room. Notice that, across room categories, objects that change in position are significantly smaller than objects that change in openness.
  • Figure 4: Model overview. The model is used for both the unshuffle and walkthrough stages. The connections specific to the walkthrough and unshuffle stages are shown in blue and red, respectively. The dashed lines represent connections from the previous time step. The model's trainable parameters, inputs and outputs, and intermediate features are shown in yellow, pink, and blue, respectively.
  • Figure 5: Performance over training. The (training-set) performance of our models over ${\sim}75$Mn training steps. We report the #Changed and %FixedStrict metrics, shown values and 95% error bars are generated using locally weighted scatterplot smoothing. Notice that the PPO models quickly saturate suggesting that they become stuck in local optima. IL continue to improve throughout training although Tab. \ref{['tab:results']} suggests that these models begin to overfit on the training scenes.
  • ...and 2 more figures