Batch Active Learning of Reward Functions from Human Preferences
Erdem Bıyık, Nima Anari, Dorsa Sadigh
TL;DR
This work tackles the data-efficiency and time-efficiency challenge of learning reward functions from human preferences in robotics. It proposes batch active preference-based learning, combining information-theoretic query selection with time-efficient batch construction, and introduces a DPP-based method to balance informativeness and diversity in batches. Through simulations and a user study across multiple robotic tasks, the DPP-based approach consistently outperforms alternative batch strategies and non-batch methods, while significantly reducing query-generation time. The findings imply that batch-active preference learning can enable scalable, parallelizable human-in-the-loop reward learning for complex robotic systems, with practical impact in rapid customization and safety-aware behavior shaping.
Abstract
Data generation and labeling are often expensive in robot learning. Preference-based learning is a concept that enables reliable labeling by querying users with preference questions. Active querying methods are commonly employed in preference-based learning to generate more informative data at the expense of parallelization and computation time. In this paper, we develop a set of novel algorithms, batch active preference-based learning methods, that enable efficient learning of reward functions using as few data samples as possible while still having short query generation times and also retaining parallelizability. We introduce a method based on determinantal point processes (DPP) for active batch generation and several heuristic-based alternatives. Finally, we present our experimental results for a variety of robotics tasks in simulation. Our results suggest that our batch active learning algorithm requires only a few queries that are computed in a short amount of time. We showcase one of our algorithms in a study to learn human users' preferences.
