Table of Contents
Fetching ...

Learning Implicit Social Navigation Behavior using Deep Inverse Reinforcement Learning

Tribhi Kathuria, Ke Liu, Junwoo Jang, X. Jessie Yang, Maani Ghaffari

TL;DR

The paper tackles learning social navigation policies in dense, dynamic environments by learning a scene-geometry conditioned reward map from demonstrations. It introduces Smooth MEDIRL (S-MEDIRL), which extrapolates demonstrations and applies a smoothing loss to produce robust, generalizable cost maps that guide local controllers. Experiments in a photo-realistic, narrow-crossing setting show that S-MEDIRL reduces deadlock and improves success rates relative to ORCA baselines and standard MEDIRL, highlighting the value of scene context and few-shot demonstrations. The work provides open-source data and code, enabling broader evaluation and adaptation to complex indoor navigation tasks.

Abstract

This paper reports on learning a reward map for social navigation in dynamic environments where the robot can reason about its path at any time, given agents' trajectories and scene geometry. Humans navigating in dense and dynamic indoor environments often work with several implied social rules. A rule-based approach fails to model all possible interactions between humans, robots, and scenes. We propose a novel Smooth Maximum Entropy Deep Inverse Reinforcement Learning (S-MEDIRL) algorithm that can extrapolate beyond expert demos to better encode scene navigability from few-shot demonstrations. The agent learns to predict the cost maps reasoning on trajectory data and scene geometry. The agent samples a trajectory that is then executed using a local crowd navigation controller. We present results in a photo-realistic simulation environment, with a robot and a human navigating a narrow crossing scenario. The robot implicitly learns to exhibit social behaviors such as yielding to oncoming traffic and avoiding deadlocks. We compare the proposed approach to the popular model-based crowd navigation algorithm ORCA and a rule-based agent that exhibits yielding.

Learning Implicit Social Navigation Behavior using Deep Inverse Reinforcement Learning

TL;DR

The paper tackles learning social navigation policies in dense, dynamic environments by learning a scene-geometry conditioned reward map from demonstrations. It introduces Smooth MEDIRL (S-MEDIRL), which extrapolates demonstrations and applies a smoothing loss to produce robust, generalizable cost maps that guide local controllers. Experiments in a photo-realistic, narrow-crossing setting show that S-MEDIRL reduces deadlock and improves success rates relative to ORCA baselines and standard MEDIRL, highlighting the value of scene context and few-shot demonstrations. The work provides open-source data and code, enabling broader evaluation and adaptation to complex indoor navigation tasks.

Abstract

This paper reports on learning a reward map for social navigation in dynamic environments where the robot can reason about its path at any time, given agents' trajectories and scene geometry. Humans navigating in dense and dynamic indoor environments often work with several implied social rules. A rule-based approach fails to model all possible interactions between humans, robots, and scenes. We propose a novel Smooth Maximum Entropy Deep Inverse Reinforcement Learning (S-MEDIRL) algorithm that can extrapolate beyond expert demos to better encode scene navigability from few-shot demonstrations. The agent learns to predict the cost maps reasoning on trajectory data and scene geometry. The agent samples a trajectory that is then executed using a local crowd navigation controller. We present results in a photo-realistic simulation environment, with a robot and a human navigating a narrow crossing scenario. The robot implicitly learns to exhibit social behaviors such as yielding to oncoming traffic and avoiding deadlocks. We compare the proposed approach to the popular model-based crowd navigation algorithm ORCA and a rule-based agent that exhibits yielding.
Paper Structure (17 sections, 5 equations, 8 figures)

This paper contains 17 sections, 5 equations, 8 figures.

Figures (8)

  • Figure 1: Robots navigating in dense indoor environments with humans in the scene exhibit implicit social behaviors (legibility, yielding, and so on). We learn from expert demonstration data to teach the robot these implicit navigation behaviors observed over a set of expert demonstrations. Figs. \ref{['fig:image1']}, \ref{['fig:image4']} depict the beginning of an episode, where the expert decides a good position to be on their way to the goal while leaving room for the human. At the intermediary goal, the agent should likely wait for the human to pass and try to avoid it, as seen in Figs. \ref{['fig:image2']}, \ref{['fig:image4']}. Finally, once the human has cleared the path, the agent decides to go to its goal, depicted by the Pink star in Figs. \ref{['fig:image3']}, \ref{['fig:image6']}
  • Figure 2: Fig. \ref{['fig:network-architecture']} shows the training architecture. The expert demos are sampled every 0.1s to create a new demonstration data point. The past trajectories of the human and robot are fed in to give the network history of past positions. The U-Net architecture feeds into the final regression layer, which outputs the learned reward. The reward is then used to sample a trajectory ($E_\mu$), which is compared to the demonstrated trajectory at any time t, looking 10 steps in the future. This gradient is then backpropagated till the network converges. Fig. \ref{['fig: inference']} shows the online deployment pipeline, where the reward inferred using the network is used to generate the reference trajectory for the local controller.
  • Figure 3: The figure shows the augmented noise data fed into the network for one of our trajectories. Fig. \ref{['fig: no_noise']} depicts the expert demo, while Fig. \ref{['fig: noise']} depicts the added noisy trajectory data.
  • Figure 4: Comparison of a reward at time $t$ trained with and without the smoothing loss. Fig. \ref{['fig:reward_0']} shows that there are fewer pixels allocated starkly different values to its neighbors compared to Fig. \ref{['fig:reward_1']}
  • Figure 5: This figure provides a snapshot of the network's input features. The first three features, extracted from the RGB image of the demonstration's top-down view, are normalized and split into Red, Green, and Blue channels (Fig. \ref{['fig: rgb_r']}-\ref{['fig: rgb_b']}). The network also receives the episode history, including the robot's past trajectory (Fig. \ref{['fig: robot_past']}) and the human's past trajectory (Fig. \ref{['fig: human_past']}). The human's current state, represented by velocity and heading (Fig. \ref{['fig: human_vel']}-\ref{['fig: human_heading']}), and the robot's goal (Fig. \ref{['fig: robot_goal']}) are also encoded as inputs.
  • ...and 3 more figures