EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments
Dongping Li, Tielong Cai, Tianci Tang, Wenhao Chai, Katherine Rose Driggs-Campbell, Gaoang Wang
TL;DR
EMMOE introduces a unified benchmark for embodied mobile manipulation in open environments, addressing the fragmentation of prior tasks by integrating high-level planning with low-level execution. It provides EMMOE-100, a 100-task dataset with 966 subtasks, rich annotations, and re-planning data, plus SFT and DPO subdatasets to align LMMs with embodied tasks. The authors present HomieBot, a hierarchical system combining Video-LLaVA-based planning and lightweight LLE models with robust error detection, and validate it against multiple baselines using three novel metrics: Task Progress, Success End Rate, and Success Re-plan Rate. Experimental results show DPO-augmented HomieBot achieving superior SR, TP, SER, and PLWSR, while analyses reveal grounding challenges and LLE limitations that influence end-to-end performance. While evaluated in simulation, EMMOE offers a scalable framework for future real-world benchmarks and task expansions, with broader impacts and limitations discussed.
Abstract
Developing autonomous home robots controlled by natural language has long been a pursuit of humanity. While advancements in large language models (LLMs) and embodied intelligence make this goal closer, several challenges persist: the lack of a unified benchmark for more complex robot tasks, limited evaluation methods and metrics, data incompatibility between LLMs and mobile manipulation trajectories. To address these issues, we propose Embodied Mobile Manipulation in Open Environments (EMMOE), a benchmark that requires agents to interpret user instructions and execute long-horizon everyday tasks in continuous space. EMMOE seamlessly integrates high-level and low-level embodied tasks into a unified framework, along with three new metrics for more diverse assessment. Additionally, we collect~\dataset, which features in various task attributes, detailed process annotations, re-plans after failures, and two sub-datasets for LLM training. Furthermore, we design~\model, a sophisticated agent system consists of LLM with Direct Preference Optimization (DPO), light weighted navigation and manipulation models, and multiple error detection mechanisms. Finally, we demonstrate~\model's performance and evaluations of different models and policies.
