EnQuery: Ensemble Policies for Diverse Query-Generation in Preference Alignment of Robot Navigation

Jorge de Heuvel; Florian Seiler; Maren Bennewitz

EnQuery: Ensemble Policies for Diverse Query-Generation in Preference Alignment of Robot Navigation

Jorge de Heuvel, Florian Seiler, Maren Bennewitz

TL;DR

EnQuery introduces a behavior-diverse ensemble of deterministic navigation policies regularized to maximize output diversity for a given state, enabling multiple plausible trajectory queries under identical task configurations. By pairing GMDR with goal-distance weighting, EnQuery generates diverse trajectories that improve information gain in low-query RLHF scenarios, while recycling prior data for efficient policy alignment. A Bradley–Terry-based reward model and a lambda-balanced alignment objective enable data-efficient preference incorporation, validated by both quantitative and qualitative analyses and a novel flow-field explainability visualization. The approach yields superior query efficiency and interpretable navigation behavior, with potential impact on human-centric robotic navigation and broader RLHF applications.

Abstract

To align mobile robot navigation policies with user preferences through reinforcement learning from human feedback (RLHF), reliable and behavior-diverse user queries are required. However, deterministic policies fail to generate a variety of navigation trajectory suggestions for a given navigation task. In this paper, we introduce EnQuery, a query generation approach using an ensemble of policies that achieve behavioral diversity through a regularization term. For a given navigation task, EnQuery produces multiple navigation trajectory suggestions, thereby optimizing the efficiency of preference data collection with fewer queries. Our methodology demonstrates superior performance in aligning navigation policies with user preferences in low-query regimes, offering enhanced policy convergence from sparse preference queries. The evaluation is complemented with a novel explainability representation, capturing full scene navigation behavior of the mobile robot in a single plot. Our code is available online at https://github.com/hrl-bonn/EnQuery.

EnQuery: Ensemble Policies for Diverse Query-Generation in Preference Alignment of Robot Navigation

TL;DR

Abstract

Paper Structure (25 sections, 5 equations, 7 figures, 2 tables)

This paper contains 25 sections, 5 equations, 7 figures, 2 tables.

Introduction
Related Work
Preliminaries
Problem Definition
Reinforcement Learning of Point Navigation
State Space
Action Space
Reward
Training Environment
Our Approach
Ensemble Generation
Querying
Baseline Querying Approach
Reward Model
Policy Alignment
...and 10 more sections

Figures (7)

Figure 1: Our ensemble of RL policies generates a variety of trajectories for a given navigation task as queries for RL from human feedback. In contrast, deterministic policies are limited to just one trajectory, and the queries' variety depends on trajectory segments from randomized scene configurations. As a result, EnQuery facilitates a higher preference information gain for low query numbers.
Figure 2: a) Diversity of actions over the ensemble and b) success rate on the navigation task in dependence of the total training time steps $T$ and the weighting factor $\kappa$ of the regularization term. The action diversity grows with the weight $\kappa$ of the regularization term, while the success rate decreases rapidly for $\kappa > 0.07$.
Figure 3: Trajectories of the ensemble policies $\pi_i$ for a given obstacle configuration and randomized start position. Each plot a) and b) shows three individual start positions. A distinct diversity of the trajectory pathways can be observed.
Figure 4: Reward model test accuracy for our EnQ approach and the baseline of segment-based uniform sampling christiano_deep_2017 over different query numbers on their native dataset (e.g., EnQ on $\mathcal{D}_\textit{ens}$) and in cross validation (e.g., EnQ on $\mathcal{D}_\textit{seg}$) The process of querying, reward model training, and testing has been repeated ten times, for which mean and standard deviation are shown. We outperform the baseline with a higher test accuracy thus information gain for low query numbers, enabling a faster learning curve time-critical learning scenarios.
Figure 5: Driving behavior for a given scene visualized by our novel explainability navigation plot, compare Sec. \ref{['sec:streamplot']} for a) the raw policy $\pi_\text{raw}$, b) the preference-aligned policy EnQ for $N_Q = 15$ queries, and c) for $N_Q = 60$ queries. The trajectory flow can be derived from any start position in the given scene to the goal (blue star), while circumnavigating the human (red dot). Regions of interest (ROI) are indicated in orange. Under the raw policy, mostly goal-directed and collision-avoiding navigation behavior can be observed. For the aligned policies, a pronounced shift away from the human at the cost of longer trajectories, e.g., on the far side of the top right obstacle appears (ROI 2). At the same time, traversal wise the area around the human is thinned out (ROI 1), as indicated by the underlying traversal map. EnQ-60 traverses closer to the human in the direct vicinity (ROI 3).
...and 2 more figures

EnQuery: Ensemble Policies for Diverse Query-Generation in Preference Alignment of Robot Navigation

TL;DR

Abstract

EnQuery: Ensemble Policies for Diverse Query-Generation in Preference Alignment of Robot Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)