Demonstration-Enhanced Adaptable Multi-Objective Robot Navigation

Jorge de Heuvel; Tharun Sethuraman; Maren Bennewitz

Demonstration-Enhanced Adaptable Multi-Objective Robot Navigation

Jorge de Heuvel, Tharun Sethuraman, Maren Bennewitz

TL;DR

The paper tackles the challenge of aligning robot navigation with evolving human preferences by integrating demonstration-driven learning into a multi-objective reinforcement learning framework. It proposes a TD3-based PD-MORL policy augmented with a D-REX–derived reward model to reflect demonstrations and enables on-the-fly preference-driven adaptation via an input strength vector $\boldsymbol{\lambda}$ without retraining. The approach yields a four-objective reward vector, balancing core navigation (goal progress and collision avoidance) with tuneable objectives for demonstration-like behavior, proxemics, and efficiency, and demonstrates strong preference reflection, robustness, and sim-to-real transfer on two robots. Real-world experiments corroborate the method’s practical viability, showing safe, adaptable navigation across static and dynamic human scenarios. Overall, the framework offers a principled, post-training mechanism to personalize robot navigation in human-centric environments with real-time control over objective trade-offs.

Abstract

Preference-aligned robot navigation in human environments is typically achieved through learning-based approaches, utilizing user feedback or demonstrations for personalization. However, personal preferences are subject to change and might even be context-dependent. Yet traditional reinforcement learning (RL) approaches with static reward functions often fall short in adapting to evolving user preferences, inevitably reflecting demonstrations once training is completed. This paper introduces a structured framework that combines demonstration-based learning with multi-objective reinforcement learning (MORL). To ensure real-world applicability, our approach allows for dynamic adaptation of the robot navigation policy to changing user preferences without retraining. It fluently modulates the amount of demonstration data reflection and other preference-related objectives. Through rigorous evaluations, including a baseline comparison and sim-to-real transfer on two robots, we demonstrate our framework's capability to adapt to user preferences accurately while achieving high navigational performance in terms of collision avoidance and goal pursuance.

Demonstration-Enhanced Adaptable Multi-Objective Robot Navigation

TL;DR

without retraining. The approach yields a four-objective reward vector, balancing core navigation (goal progress and collision avoidance) with tuneable objectives for demonstration-like behavior, proxemics, and efficiency, and demonstrates strong preference reflection, robustness, and sim-to-real transfer on two robots. Real-world experiments corroborate the method’s practical viability, showing safe, adaptable navigation across static and dynamic human scenarios. Overall, the framework offers a principled, post-training mechanism to personalize robot navigation in human-centric environments with real-time control over objective trade-offs.

Abstract

Paper Structure (22 sections, 3 equations, 5 figures, 1 table)

This paper contains 22 sections, 3 equations, 5 figures, 1 table.

Introduction
Related Work
Our Approach
Problem Statement
Multi-Objective Reinforcement Learning
State and Action Space
Networks
Incorporating Demonstrations
Reward Vector
Navigational Core Objectives
Tuneable Preference Objectives
Demonstration Acquisition and Reward
Experimental Evaluation
Training and Environment
Qualitative Navigation Analysis
...and 7 more sections

Figures (5)

Figure 1: Our framework integrates demonstration-based learning into multi-objective reinforcement learning, enabling robots to adapt navigation policies to users' changing preferences even after training. a) The navigation style can fluently shift between demonstration-induced, distance keeping, and efficiency objectives. b) We modulate the MORL reward vector $\boldsymbol{r}_t$ with a c) varying preference $\boldsymbol{\lambda}$, while providing $\boldsymbol{\lambda}$ as input to the agent. d) The resulting human-centered policy can generate a spectrum of trajectories, here sketched for the objectives of demonstration-reflection (red, here: wall-following) and path efficiency (yellow).
Figure 2: Exploration of D-REX-related demonstration parameters averaged over 20 trajectory rollouts, measured against the optimal demonstration behavior's reward. a) The execution of the $\epsilon$-greedy noise-injected behavior cloning (BC) policy trained with a demonstration augmentation factor of $N_D=$ 1,000 reveals a degradation of navigation performance measured by the normalized core reward $r_\text{core}$ with growing strength of the injected noise. b) The demonstration augmentation factor $N_D$ indicates how many times the optimal human-centric demonstration trajectory (see Sec.\ref{['sec:demo_reward']}) was rolled out with randomized obstacle placement to form the training dataset, showing increased performance with higher $N_D$.
Figure 3: Trajectory rollouts in simulation for different preference vectors (rows) and different scenes with a static and a dynamic approaching human (columns). As can be seen, the navigation policy shifts its behavior according to the set preference. The colorbars on the right indicate the interpolated preference space $\Lambda_i$ for each plot row. Static scenarios such as (A+B) were covered during training, while a moving human (C+D) and the corridor environment (E) test for generalization. While shifting Row 1) from shortest driving behavior under the maximum efficiency preference (yellow) to distance-keeping (blue), the minimum distance from the human increases. At the same time, a tendency to navigate alongside obstacles - if present close to the path - has developed. Shifting towards the maximum demonstration preference (Row 2), the trajectory shapes increasingly resemble the demonstration pattern (black). On the shift back to maximum efficiency (Row 3), the demonstration pattern disappears in favor of shortest trajectories. Comparing the static (A+B) vs. moving human (C+D), the demonstration preference reflection becomes less distinct as the agent struggles to follow the static pattern that moves with the now dynamic human, yet efficiency and distance preferences keep up with a moving human. In the corridor intersection scene (E), not included during training of the policy, the agent successfully accounts for the wall, reducing the possible distance-keeping to the human. The varied angle between human and goal from the robot's perspective does not prevent the policy from first approaching the human under the maximum demonstration preference, before continuing towards the goal.
Figure 4: Quantitative metrics of OUR agent for different preference configurations (e), tested for statistical significance for dissimilar means between the maximum preferences, with *** for $p<.001$, and ns for not significant. a) The navigation time is smallest for maximized efficiency preference, as expected. b) The Fréchet distance to the demonstration trajectory decreases as the demonstration preference increases. c) The minimum distance to any obstacle is measured using the lidar. d) The minimum distance from the human grows with its preference weight. The preference-independent non-MORL policy CORE (red dotted line) that only obeys the navigational core reward term $r_\text{core}$ of collision avoidance and goal pursuance is included in each plot.
Figure 5: Real-world experiment setup (top) and results (bottom) with the policy OUR in a sim-to-real transfer with the Kobuki TurtleBot 2 (left) and a the Toyota HSR (right). With a static human as during training (A+B), the navigation behavior in the real world successfully reflects varying the preferences on both robots. While the TurtleBot exhibits better demonstration reflection, the HSR keeps more distance from the human under the maximum distance preference. With a dynamic approaching human (C+D) that was not accounted for during training, the preference reflection decreases.

Demonstration-Enhanced Adaptable Multi-Objective Robot Navigation

TL;DR

Abstract

Demonstration-Enhanced Adaptable Multi-Objective Robot Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)