Table of Contents
Fetching ...

VirtualHome: Simulating Household Activities via Programs

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, Antonio Torralba

TL;DR

Problem: robots require explicit, executable representations of complex household tasks. Approach: collect a large knowledge base of home activities encoded as programs, build VirtualHome simulator, and develop encoder-decoder models to translate natural language or video into programs to drive agents. Contributions: ActivityPrograms dataset, VirtualHome simulator with rich ground-truth, methods for program generation from text and video using RL, and demonstration of task execution in simulation. Impact: enables scalable training and evaluation for vision+robotics systems on realistic, multi-step household activities and provides a platform for future learning from demonstrations.

Abstract

In this paper, we are interested in modeling complex activities that occur in a typical household. We propose to use programs, i.e., sequences of atomic actions and interactions, as a high level representation of complex tasks. Programs are interesting because they provide a non-ambiguous representation of a task, and allow agents to execute them. However, nowadays, there is no database providing this type of information. Towards this goal, we first crowd-source programs for a variety of activities that happen in people's homes, via a game-like interface used for teaching kids how to code. Using the collected dataset, we show how we can learn to extract programs directly from natural language descriptions or from videos. We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to "drive" an artificial agent to execute tasks in a simulated household environment. Our VirtualHome simulator allows us to create a large activity video dataset with rich ground-truth, enabling training and testing of video understanding models. We further showcase examples of our agent performing tasks in our VirtualHome based on language descriptions.

VirtualHome: Simulating Household Activities via Programs

TL;DR

Problem: robots require explicit, executable representations of complex household tasks. Approach: collect a large knowledge base of home activities encoded as programs, build VirtualHome simulator, and develop encoder-decoder models to translate natural language or video into programs to drive agents. Contributions: ActivityPrograms dataset, VirtualHome simulator with rich ground-truth, methods for program generation from text and video using RL, and demonstration of task execution in simulation. Impact: enables scalable training and evaluation for vision+robotics systems on realistic, multi-step household activities and provides a platform for future learning from demonstrations.

Abstract

In this paper, we are interested in modeling complex activities that occur in a typical household. We propose to use programs, i.e., sequences of atomic actions and interactions, as a high level representation of complex tasks. Programs are interesting because they provide a non-ambiguous representation of a task, and allow agents to execute them. However, nowadays, there is no database providing this type of information. Towards this goal, we first crowd-source programs for a variety of activities that happen in people's homes, via a game-like interface used for teaching kids how to code. Using the collected dataset, we show how we can learn to extract programs directly from natural language descriptions or from videos. We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to "drive" an artificial agent to execute tasks in a simulated household environment. Our VirtualHome simulator allows us to create a large activity video dataset with rich ground-truth, enabling training and testing of video understanding models. We further showcase examples of our agent performing tasks in our VirtualHome based on language descriptions.

Paper Structure

This paper contains 14 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We first crowdsource a large knowledge base of household tasks, (top). Each task has a high level name, and a natural language instruction. We then collect "programs" for these tasks, (middle left), where the annotators "translate" the instruction into simple code. We implement the most frequent (inter)actions in a 3D simulator, called VirtualHouse, allowing us to drive an agent to execute tasks defined by programs. We propose methods to generate programs automatically from text (top) and video (bottom), thus driving an agent via language and a video demonstration.
  • Figure 2: VirtualHome Activity Dataset is a video dataset of composite activities created with our simulator. We start by generating programs using a simple probabilistic grammar. We animate each program in VirtualHome by randomizing the selection of homes, agents, cameras, as well as the placement of a subset of the objects, the initial location of the agent, the speed of the actions, and choice of objects for interactions. Each program is shown to an annotator who is asked to describe it in natural language (top row). Videos have ground-truth: (second row) time-stamp for each atomic action, (bottom) 2D and 3D pose, class and object instance segmentation, depth and optical flow.
  • Figure 3: a) Description provided by a worker. b) User interface showing the list of block categories and 4 example blocks, c) Example of composition of a block by adding the arguments. Each block is like a Lego piece where the user can drop arguments inside and attach one block to another. d) Final program corresponding to the description from (a).
  • Figure 4: a) Counts of actions in our ActivityPrograms dataset, b) object counts (zoom to read)
  • Figure 5: 3D households in our VirtualHome. Notice the diversity in room and object layout and appearance. Each home has on average $357$ objects. First $4$ scenes are used for training, the fifth is also used in val, and all scenes are used when testing our video-to-script model.
  • ...and 4 more figures