Table of Contents
Fetching ...

The MineRL BASALT Competition on Learning from Human Feedback

Rohin Shah, Cody Wild, Steven H. Wang, Neel Alex, Brandon Houghton, William Guss, Sharada Mohanty, Anssi Kanervisto, Stephanie Milani, Nicholay Topin, Pieter Abbeel, Stuart Russell, Anca Dragan

TL;DR

MineRL BASALT investigates learning from human feedback to tackle tasks without explicit reward functions in an open-world Minecraft setting. It introduces four natural-language-described tasks, provides demonstration data and a human-evaluation-based scoring protocol using TrueSkill, and supplies a behavioral cloning baseline along with starter tooling. The competition aims to advance LfHF methods for governance, value alignment, and safer AI deployment by emphasizing generalization and open-ended objectives. By combining open-world benchmarks, human judgments, and a no-holds-barred protocol, it seeks to catalyze progress toward agents that interpret designer intent and operate under human-specified constraints.

Abstract

The last decade has seen a significant increase of interest in deep learning research, with many public successes that have demonstrated its potential. As such, these systems are now being incorporated into commercial products. With this comes an additional challenge: how can we build AI systems that solve tasks where there is not a crisp, well-defined specification? While multiple solutions have been proposed, in this competition we focus on one in particular: learning from human feedback. Rather than training AI systems using a predefined reward function or using a labeled dataset with a predefined set of categories, we instead train the AI system using a learning signal derived from some form of human feedback, which can evolve over time as the understanding of the task changes, or as the capabilities of the AI system improve. The MineRL BASALT competition aims to spur forward research on this important class of techniques. We design a suite of four tasks in Minecraft for which we expect it will be hard to write down hardcoded reward functions. These tasks are defined by a paragraph of natural language: for example, "create a waterfall and take a scenic picture of it", with additional clarifying details. Participants must train a separate agent for each task, using any method they want. Agents are then evaluated by humans who have read the task description. To help participants get started, we provide a dataset of human demonstrations on each of the four tasks, as well as an imitation learning baseline that leverages these demonstrations. Our hope is that this competition will improve our ability to build AI systems that do what their designers intend them to do, even when the intent cannot be easily formalized. Besides allowing AI to solve more tasks, this can also enable more effective regulation of AI systems, as well as making progress on the value alignment problem.

The MineRL BASALT Competition on Learning from Human Feedback

TL;DR

MineRL BASALT investigates learning from human feedback to tackle tasks without explicit reward functions in an open-world Minecraft setting. It introduces four natural-language-described tasks, provides demonstration data and a human-evaluation-based scoring protocol using TrueSkill, and supplies a behavioral cloning baseline along with starter tooling. The competition aims to advance LfHF methods for governance, value alignment, and safer AI deployment by emphasizing generalization and open-ended objectives. By combining open-world benchmarks, human judgments, and a no-holds-barred protocol, it seeks to catalyze progress toward agents that interpret designer intent and operate under human-specified constraints.

Abstract

The last decade has seen a significant increase of interest in deep learning research, with many public successes that have demonstrated its potential. As such, these systems are now being incorporated into commercial products. With this comes an additional challenge: how can we build AI systems that solve tasks where there is not a crisp, well-defined specification? While multiple solutions have been proposed, in this competition we focus on one in particular: learning from human feedback. Rather than training AI systems using a predefined reward function or using a labeled dataset with a predefined set of categories, we instead train the AI system using a learning signal derived from some form of human feedback, which can evolve over time as the understanding of the task changes, or as the capabilities of the AI system improve. The MineRL BASALT competition aims to spur forward research on this important class of techniques. We design a suite of four tasks in Minecraft for which we expect it will be hard to write down hardcoded reward functions. These tasks are defined by a paragraph of natural language: for example, "create a waterfall and take a scenic picture of it", with additional clarifying details. Participants must train a separate agent for each task, using any method they want. Agents are then evaluated by humans who have read the task description. To help participants get started, we provide a dataset of human demonstrations on each of the four tasks, as well as an imitation learning baseline that leverages these demonstrations. Our hope is that this competition will improve our ability to build AI systems that do what their designers intend them to do, even when the intent cannot be easily formalized. Besides allowing AI to solve more tasks, this can also enable more effective regulation of AI systems, as well as making progress on the value alignment problem.

Paper Structure

This paper contains 27 sections, 1 equation, 2 figures.

Figures (2)

  • Figure 1: An illustration of the MineRL BASALT competition procedure. We provide tasks consisting of a simple English language description alongside a Gym environment, without any associated reward function. Participants will train agents for these tasks using their preferred methods. Submitted agents will be evaluated based on how well they complete the tasks, as judged by humans given the same task descriptions.
  • Figure 2: Evaluation workflow. Human workers are first shown the description of the task as well as calibration examples of how good particular trajectories are. They are then shown two agent trajectories, and are asked which agent performed the task better. From these comparisons, we compute scores using the TrueSkill system, and compute final scores by averaging the normalized TrueSkill scores across tasks.