Table of Contents
Fetching ...

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

Taewoong Kim, Cheolhong Min, Byeonghwi Kim, Jinyeon Kim, Wonje Jeung, Jonghyun Choi

TL;DR

ReALFRED addresses the gap between synthetic embodied AI benchmarks and real-world deployment by providing 3D-captured, object-interactable, multi-room environments with free-form language instructions. The authors create a large-scale dataset (150 houses, 114 object types, 30,696 directives) and expert demonstrations via a PDDL-based planner, and evaluate multiple baselines including sim-to-real and real-to-real transfer with GAN-domain adaptation. Results show that state-of-the-art methods struggle in ReALFRED's realism and scale, motivating new approaches and highlighting the importance of real-world-like data for robust instruction following. The benchmark and public data/code aim to accelerate progress toward deployable, language-driven robotic agents.

Abstract

Simulated virtual environments have been widely used to learn robotic agents that perform daily household tasks. These environments encourage research progress by far, but often provide limited object interactability, visual appearance different from real-world environments, or relatively smaller environment sizes. This prevents the learned models in the virtual scenes from being readily deployable. To bridge the gap between these learning environments and deploying (i.e., real) environments, we propose the ReALFRED benchmark that employs real-world scenes, objects, and room layouts to learn agents to complete household tasks by understanding free-form language instructions and interacting with objects in large, multi-room and 3D-captured scenes. Specifically, we extend the ALFRED benchmark with updates for larger environmental spaces with smaller visual domain gaps. With ReALFRED, we analyze previously crafted methods for the ALFRED benchmark and observe that they consistently yield lower performance in all metrics, encouraging the community to develop methods in more realistic environments. Our code and data are publicly available.

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

TL;DR

ReALFRED addresses the gap between synthetic embodied AI benchmarks and real-world deployment by providing 3D-captured, object-interactable, multi-room environments with free-form language instructions. The authors create a large-scale dataset (150 houses, 114 object types, 30,696 directives) and expert demonstrations via a PDDL-based planner, and evaluate multiple baselines including sim-to-real and real-to-real transfer with GAN-domain adaptation. Results show that state-of-the-art methods struggle in ReALFRED's realism and scale, motivating new approaches and highlighting the importance of real-world-like data for robust instruction following. The benchmark and public data/code aim to accelerate progress toward deployable, language-driven robotic agents.

Abstract

Simulated virtual environments have been widely used to learn robotic agents that perform daily household tasks. These environments encourage research progress by far, but often provide limited object interactability, visual appearance different from real-world environments, or relatively smaller environment sizes. This prevents the learned models in the virtual scenes from being readily deployable. To bridge the gap between these learning environments and deploying (i.e., real) environments, we propose the ReALFRED benchmark that employs real-world scenes, objects, and room layouts to learn agents to complete household tasks by understanding free-form language instructions and interacting with objects in large, multi-room and 3D-captured scenes. Specifically, we extend the ALFRED benchmark with updates for larger environmental spaces with smaller visual domain gaps. With ReALFRED, we analyze previously crafted methods for the ALFRED benchmark and observe that they consistently yield lower performance in all metrics, encouraging the community to develop methods in more realistic environments. Our code and data are publicly available.
Paper Structure (26 sections, 21 figures, 8 tables)

This paper contains 26 sections, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Proposed ReALFRED benchmark. The top image provides a perspective view of one of our scenes. The images below represent third-person views at each time step, along with their corresponding descriptions, for better understanding. The agent is required to understand instructions in natural language and then complete the desired tasks by navigating large 3D-captured environments and interacting with objects.
  • Figure 2: While other benchmarks chang2017matterport3dramakrishnan2021hm3dxia2018gibsonramrakhya2022habitatchen2019touchdownkrantz2020navgraphweihs2021visualehsani2021manipulathorszot2021habitatshridhar2020alfredpadmakumar2022teachmacmahon2006walkmisra2018mappingli2022behavior provide one or two aspects, our proposed ReALFRED benchmark addresses all of these aspects.
  • Figure 3: Top-down view of 3D-captured environments. We provide two examples from our scanned indoor environments. White circles denote where scanners are deployed. By scanning scenes at diverse points, we can prevent including blind spots.
  • Figure 4: Distribution of navigable and floor areas in interactive benchmarks by scenes. 'Floor area' denotes the overall size of the scene. 'Navigable area' denotes the size of the space in which the agent can actually navigate. For both metrics, the ReALFRED benchmark poses a more even distribution and provides larger areas. For (a), we exclude RoboTHOR since it consists of single-sized floors.
  • Figure 5: The seven types of tasks' distribution in ReALFRED. We provide $37.6\%$ more tasks in valid sets and $19.3\%$ in total compared to previous benchmark shridhar2020alfred.
  • ...and 16 more figures