Table of Contents
Fetching ...

BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments

Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, C. Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, Li Fei-Fei

TL;DR

BEHAVIOR presents a comprehensive benchmark for embodied AI focused on 100 realistic, diverse, and long-horizon household activities. It introduces BEHAVIOR Domain Definition Language (BDDL) to declaratively specify initial and goal conditions, a simulator-agnostic framework instantiated in iGibson 2.0 with rich object datasets, and a suite of evaluation metrics including a human-grounded primary score (Success Score Q) and multiple efficiency measures. A large corpus of 500 VR human demonstrations is provided to serve as ground truth for progress and imitation learning, and preliminary RL experiments demonstrate the difficulty of BEHAVIOR and the need for hierarchical planning and realistic actuation. The work aims to calibrate and accelerate development of robust, generalizable embodied AI capable of performing complex daily activities, with potential real-world transfer and wide-open opportunities for future benchmarks and methodology improvements.

Abstract

We introduce BEHAVIOR, a benchmark for embodied AI with 100 activities in simulation, spanning a range of everyday household chores such as cleaning, maintenance, and food preparation. These activities are designed to be realistic, diverse, and complex, aiming to reproduce the challenges that agents must face in the real world. Building such a benchmark poses three fundamental difficulties for each activity: definition (it can differ by time, place, or person), instantiation in a simulator, and evaluation. BEHAVIOR addresses these with three innovations. First, we propose an object-centric, predicate logic-based description language for expressing an activity's initial and goal conditions, enabling generation of diverse instances for any activity. Second, we identify the simulator-agnostic features required by an underlying environment to support BEHAVIOR, and demonstrate its realization in one such simulator. Third, we introduce a set of metrics to measure task progress and efficiency, absolute and relative to human demonstrators. We include 500 human demonstrations in virtual reality (VR) to serve as the human ground truth. Our experiments demonstrate that even state of the art embodied AI solutions struggle with the level of realism, diversity, and complexity imposed by the activities in our benchmark. We make BEHAVIOR publicly available at behavior.stanford.edu to facilitate and calibrate the development of new embodied AI solutions.

BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments

TL;DR

BEHAVIOR presents a comprehensive benchmark for embodied AI focused on 100 realistic, diverse, and long-horizon household activities. It introduces BEHAVIOR Domain Definition Language (BDDL) to declaratively specify initial and goal conditions, a simulator-agnostic framework instantiated in iGibson 2.0 with rich object datasets, and a suite of evaluation metrics including a human-grounded primary score (Success Score Q) and multiple efficiency measures. A large corpus of 500 VR human demonstrations is provided to serve as ground truth for progress and imitation learning, and preliminary RL experiments demonstrate the difficulty of BEHAVIOR and the need for hierarchical planning and realistic actuation. The work aims to calibrate and accelerate development of robust, generalizable embodied AI capable of performing complex daily activities, with potential real-world transfer and wide-open opportunities for future benchmarks and methodology improvements.

Abstract

We introduce BEHAVIOR, a benchmark for embodied AI with 100 activities in simulation, spanning a range of everyday household chores such as cleaning, maintenance, and food preparation. These activities are designed to be realistic, diverse, and complex, aiming to reproduce the challenges that agents must face in the real world. Building such a benchmark poses three fundamental difficulties for each activity: definition (it can differ by time, place, or person), instantiation in a simulator, and evaluation. BEHAVIOR addresses these with three innovations. First, we propose an object-centric, predicate logic-based description language for expressing an activity's initial and goal conditions, enabling generation of diverse instances for any activity. Second, we identify the simulator-agnostic features required by an underlying environment to support BEHAVIOR, and demonstrate its realization in one such simulator. Third, we introduce a set of metrics to measure task progress and efficiency, absolute and relative to human demonstrators. We include 500 human demonstrations in virtual reality (VR) to serve as the human ground truth. Our experiments demonstrate that even state of the art embodied AI solutions struggle with the level of realism, diversity, and complexity imposed by the activities in our benchmark. We make BEHAVIOR publicly available at behavior.stanford.edu to facilitate and calibrate the development of new embodied AI solutions.

Paper Structure

This paper contains 43 sections, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Benchmarking Embodied AI with BEHAVIOR: ⓐ We define 100 realistic household activities from the American Time Use Survey atus and define them with a set of relevant objects, organized with WordNet miller1995wordnet, and logic-symbolic initial and goal conditions in BDDL (Sec. \ref{['s:bddl']}). ⓑ We provide an implementation of BEHAVIOR in iGibson 2.0 that generates potentially infinite diverse activity instances in realistic home scenes using the definition. ⓒ AI agents perform the activities in simulation through continuous physical interactions of an embodied avatar with the environment. Humans can perform the same activities in VR. BEHAVIOR includes a dataset of 500 successful VR demonstrations. ⓓ Changes in the scene are continuously mapped to their logic-symbolic equivalent representation in BDDL and checked against the goal condition; we provide intermediate success scores, metrics on agent's efficiency, and a human-centric metric relative to the demonstrations.
  • Figure 2: Unary and Binary Predicates in BDDL: We represent object states and relationships to other objects based on their kinematics, temperature, wetness level and other physical and functional properties, enabling a diverse and complex set of realistic activities
  • Figure 3: Evaluation of human performance in collect_misplaced_items: (Left) success score, $Q$; (Right) efficiency metrics: kinematic disarrangement, ($D_k$, dotted), hand interaction displacement ($L_\mathit{right}$, green, and $L_\mathit{left}$, blue); frames at the top depict significant events detected by the metrics; the success score detects the completion of activity-relevant steps; exploration, manipulation and scene disruption events are captured by the efficiency metrics that provide complementary information about the performance of the agent
  • Figure A.1: BEHAVIOR 100 activities: Each pair of images depict a frame of the execution of the activity in BEHAVIOR from the agent's perspective in virtual reality (left) and the same activity in real-life from a YouTube video (right). All activities are selected from the American Time Use Survey atus, and correspond to simulatable household chores relevant in human's everyday life. The set of activities cover common areas like cleaning, maintenance, preparation for social activities, or household management.
  • Figure A.2: BEHAVIOR 100 activities (cont.)
  • ...and 15 more figures