Table of Contents
Fetching ...

EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments

Dongping Li, Tielong Cai, Tianci Tang, Wenhao Chai, Katherine Rose Driggs-Campbell, Gaoang Wang

TL;DR

EMMOE introduces a unified benchmark for embodied mobile manipulation in open environments, addressing the fragmentation of prior tasks by integrating high-level planning with low-level execution. It provides EMMOE-100, a 100-task dataset with 966 subtasks, rich annotations, and re-planning data, plus SFT and DPO subdatasets to align LMMs with embodied tasks. The authors present HomieBot, a hierarchical system combining Video-LLaVA-based planning and lightweight LLE models with robust error detection, and validate it against multiple baselines using three novel metrics: Task Progress, Success End Rate, and Success Re-plan Rate. Experimental results show DPO-augmented HomieBot achieving superior SR, TP, SER, and PLWSR, while analyses reveal grounding challenges and LLE limitations that influence end-to-end performance. While evaluated in simulation, EMMOE offers a scalable framework for future real-world benchmarks and task expansions, with broader impacts and limitations discussed.

Abstract

Developing autonomous home robots controlled by natural language has long been a pursuit of humanity. While advancements in large language models (LLMs) and embodied intelligence make this goal closer, several challenges persist: the lack of a unified benchmark for more complex robot tasks, limited evaluation methods and metrics, data incompatibility between LLMs and mobile manipulation trajectories. To address these issues, we propose Embodied Mobile Manipulation in Open Environments (EMMOE), a benchmark that requires agents to interpret user instructions and execute long-horizon everyday tasks in continuous space. EMMOE seamlessly integrates high-level and low-level embodied tasks into a unified framework, along with three new metrics for more diverse assessment. Additionally, we collect~\dataset, which features in various task attributes, detailed process annotations, re-plans after failures, and two sub-datasets for LLM training. Furthermore, we design~\model, a sophisticated agent system consists of LLM with Direct Preference Optimization (DPO), light weighted navigation and manipulation models, and multiple error detection mechanisms. Finally, we demonstrate~\model's performance and evaluations of different models and policies.

EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments

TL;DR

EMMOE introduces a unified benchmark for embodied mobile manipulation in open environments, addressing the fragmentation of prior tasks by integrating high-level planning with low-level execution. It provides EMMOE-100, a 100-task dataset with 966 subtasks, rich annotations, and re-planning data, plus SFT and DPO subdatasets to align LMMs with embodied tasks. The authors present HomieBot, a hierarchical system combining Video-LLaVA-based planning and lightweight LLE models with robust error detection, and validate it against multiple baselines using three novel metrics: Task Progress, Success End Rate, and Success Re-plan Rate. Experimental results show DPO-augmented HomieBot achieving superior SR, TP, SER, and PLWSR, while analyses reveal grounding challenges and LLE limitations that influence end-to-end performance. While evaluated in simulation, EMMOE offers a scalable framework for future real-world benchmarks and task expansions, with broader impacts and limitations discussed.

Abstract

Developing autonomous home robots controlled by natural language has long been a pursuit of humanity. While advancements in large language models (LLMs) and embodied intelligence make this goal closer, several challenges persist: the lack of a unified benchmark for more complex robot tasks, limited evaluation methods and metrics, data incompatibility between LLMs and mobile manipulation trajectories. To address these issues, we propose Embodied Mobile Manipulation in Open Environments (EMMOE), a benchmark that requires agents to interpret user instructions and execute long-horizon everyday tasks in continuous space. EMMOE seamlessly integrates high-level and low-level embodied tasks into a unified framework, along with three new metrics for more diverse assessment. Additionally, we collect~\dataset, which features in various task attributes, detailed process annotations, re-plans after failures, and two sub-datasets for LLM training. Furthermore, we design~\model, a sophisticated agent system consists of LLM with Direct Preference Optimization (DPO), light weighted navigation and manipulation models, and multiple error detection mechanisms. Finally, we demonstrate~\model's performance and evaluations of different models and policies.

Paper Structure

This paper contains 69 sections, 5 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Data example in EMMOE-100. A key feature of EMMOE-100 is the emphasis on the reasoning process and interleaved execution. In the shown task, the agent must check the fridge first. Otherwise, even if the agent finally gets a banana in the kitchen, it will not be considered as a success.
  • Figure 2: Overview of HomieBot. HomieBot leverages a hierarchical framework to handle long-horizon tasks: High-Level Planning decomposes tasks into manageable actions, Low-Level Execution accomplishes received actions and provides real-time feedback.
  • Figure 3: Error Statistics. The left and right figures depict the proportion of each error type of each model in successful and failed trajectories respectively. Additionally, we indicate the proportion of total execution failures next to each model's name. Due to too few successful trajectories for Qwen2-VL and MiniCPM-V 2.6, their results will not be shown in the left figure. The full statistical data in digital counts are available in Appendix \ref{['sec:supp_exp_results']}.
  • Figure B1: Data collection interface in Habitat-lab v0.2.3. Third-person observation in the left is used to facilitate data collection, only first-person observation with 256$*$256 resolution in the right will be saved.
  • Figure B2: Dataset Statistics
  • ...and 2 more figures