BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments
Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, C. Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, Li Fei-Fei
TL;DR
BEHAVIOR presents a comprehensive benchmark for embodied AI focused on 100 realistic, diverse, and long-horizon household activities. It introduces BEHAVIOR Domain Definition Language (BDDL) to declaratively specify initial and goal conditions, a simulator-agnostic framework instantiated in iGibson 2.0 with rich object datasets, and a suite of evaluation metrics including a human-grounded primary score (Success Score Q) and multiple efficiency measures. A large corpus of 500 VR human demonstrations is provided to serve as ground truth for progress and imitation learning, and preliminary RL experiments demonstrate the difficulty of BEHAVIOR and the need for hierarchical planning and realistic actuation. The work aims to calibrate and accelerate development of robust, generalizable embodied AI capable of performing complex daily activities, with potential real-world transfer and wide-open opportunities for future benchmarks and methodology improvements.
Abstract
We introduce BEHAVIOR, a benchmark for embodied AI with 100 activities in simulation, spanning a range of everyday household chores such as cleaning, maintenance, and food preparation. These activities are designed to be realistic, diverse, and complex, aiming to reproduce the challenges that agents must face in the real world. Building such a benchmark poses three fundamental difficulties for each activity: definition (it can differ by time, place, or person), instantiation in a simulator, and evaluation. BEHAVIOR addresses these with three innovations. First, we propose an object-centric, predicate logic-based description language for expressing an activity's initial and goal conditions, enabling generation of diverse instances for any activity. Second, we identify the simulator-agnostic features required by an underlying environment to support BEHAVIOR, and demonstrate its realization in one such simulator. Third, we introduce a set of metrics to measure task progress and efficiency, absolute and relative to human demonstrators. We include 500 human demonstrations in virtual reality (VR) to serve as the human ground truth. Our experiments demonstrate that even state of the art embodied AI solutions struggle with the level of realism, diversity, and complexity imposed by the activities in our benchmark. We make BEHAVIOR publicly available at behavior.stanford.edu to facilitate and calibrate the development of new embodied AI solutions.
