Table of Contents
Fetching ...

Batch Active Learning of Reward Functions from Human Preferences

Erdem Bıyık, Nima Anari, Dorsa Sadigh

TL;DR

This work tackles the data-efficiency and time-efficiency challenge of learning reward functions from human preferences in robotics. It proposes batch active preference-based learning, combining information-theoretic query selection with time-efficient batch construction, and introduces a DPP-based method to balance informativeness and diversity in batches. Through simulations and a user study across multiple robotic tasks, the DPP-based approach consistently outperforms alternative batch strategies and non-batch methods, while significantly reducing query-generation time. The findings imply that batch-active preference learning can enable scalable, parallelizable human-in-the-loop reward learning for complex robotic systems, with practical impact in rapid customization and safety-aware behavior shaping.

Abstract

Data generation and labeling are often expensive in robot learning. Preference-based learning is a concept that enables reliable labeling by querying users with preference questions. Active querying methods are commonly employed in preference-based learning to generate more informative data at the expense of parallelization and computation time. In this paper, we develop a set of novel algorithms, batch active preference-based learning methods, that enable efficient learning of reward functions using as few data samples as possible while still having short query generation times and also retaining parallelizability. We introduce a method based on determinantal point processes (DPP) for active batch generation and several heuristic-based alternatives. Finally, we present our experimental results for a variety of robotics tasks in simulation. Our results suggest that our batch active learning algorithm requires only a few queries that are computed in a short amount of time. We showcase one of our algorithms in a study to learn human users' preferences.

Batch Active Learning of Reward Functions from Human Preferences

TL;DR

This work tackles the data-efficiency and time-efficiency challenge of learning reward functions from human preferences in robotics. It proposes batch active preference-based learning, combining information-theoretic query selection with time-efficient batch construction, and introduces a DPP-based method to balance informativeness and diversity in batches. Through simulations and a user study across multiple robotic tasks, the DPP-based approach consistently outperforms alternative batch strategies and non-batch methods, while significantly reducing query-generation time. The findings imply that batch-active preference learning can enable scalable, parallelizable human-in-the-loop reward learning for complex robotic systems, with practical impact in rapid customization and safety-aware behavior shaping.

Abstract

Data generation and labeling are often expensive in robot learning. Preference-based learning is a concept that enables reliable labeling by querying users with preference questions. Active querying methods are commonly employed in preference-based learning to generate more informative data at the expense of parallelization and computation time. In this paper, we develop a set of novel algorithms, batch active preference-based learning methods, that enable efficient learning of reward functions using as few data samples as possible while still having short query generation times and also retaining parallelizability. We introduce a method based on determinantal point processes (DPP) for active batch generation and several heuristic-based alternatives. Finally, we present our experimental results for a variety of robotics tasks in simulation. Our results suggest that our batch active learning algorithm requires only a few queries that are computed in a short amount of time. We showcase one of our algorithms in a study to learn human users' preferences.
Paper Structure (24 sections, 1 theorem, 31 equations, 12 figures, 2 tables, 4 algorithms)

This paper contains 24 sections, 1 theorem, 31 equations, 12 figures, 2 tables, 4 algorithms.

Key Result

Theorem 1

The above algorithm finds an $e^k$-approximation of the mode.

Figures (12)

  • Figure 1: Batches should be both diverse and informative in batch active preference-based learning. Here, a hypothetical batch selection problem is visualized. Each cross represents a query. Similar queries are close to each other. Orange shows the queries selected in that iteration, and blue shows the queries for which the human responses have already been collected in the previous iterations. Green color represents informativeness: darker regions correspond to the queries with high informativeness based on the information collected until that iteration. (Top) Maximizing only informativeness generates batches that include very similar queries which, when queried together, carry redundant information. (Middle) Maximizing only diversity does not take informativeness into account at all, and so is wasteful as it selects some queries that are not informative. (Bottom) A good batch active learning algorithm should both select informative queries and avoid redundancy.
  • Figure 2: The schematic of the preferences based-learning problem starting from two sample inputs $(x^0, \mathbf{u}_A)$ and $(x^0, \mathbf{u}_B)$.
  • Figure 3: Visualizations of the batch generation process of the proposed time-efficient batch active learning algorithms. In each visual, a simple 2D space with 16 different $\psi$ values that correspond to the reduced set $\mathcal{X}$ is shown. The goal is to select a batch of $k=5$ that will near-optimally maximize the joint information gain. The selected queries are shown in orange. (a) Greedy Selection. (b) Medoids Selection. The points are selected based on the $k$-medoids clustering algorithm. (c) Boundary Medoids Selection. The clusters are chosen over the boundary of the convex hull of all samples. (d) Successive Elimination. One point is selected and another is eliminated based on pairwise comparisons of mutual information.
  • Figure 4: The effect of $\alpha$ is visualized. The columns of the matrix $L$ have the same length here; however $\{1,3\}$ is a more diverse set than $\{1,2\}$. When $\alpha=1$, $\{1,3\}$ is two times more likely to be sampled from the DPP distribution than $\{1,2\}$. When we increase $\alpha$ to $2$, this ratio increases to $4$, since more diverse sets are boosted against the less diverse sets.
  • Figure 5: Simulation view of each environment. (a) Fetch, (b) Driver, (c) Tosser, (d) Lunar Lander, (e) Swimmer.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof