Table of Contents
Fetching ...

Multi-Agent Inverse Reinforcement Learning in Real World Unstructured Pedestrian Crowds

Rohan Chandra, Haresh Karnan, Negar Mehr, Peter Stone, Joydeep Biswas

TL;DR

Problem: learn human intent for socially compliant robot navigation in real-world crowds. Approach: multi-agent MaxEnt IRL with a tractability-rationality trade-off, leveraging a constant-velocity dynamics approximation, Taylor-based quadratic cost surrogates, and Gaussian policies rolled out via a dynamic game solver; the key trick is a covariance modification $\widetilde{\Sigma}^i_t = \Sigma^i_t + g(\mathrm{diag}(\Sigma^i_t)) I$. Contributions: (i) a tractable MAIRL framework for unstructured crowds, (ii) a Hessian-based tractability technique enabling learning with limited data, and (iii) empirical gains on Speedway with competitive performance on ETH/UCY/JRDB/SCAND. Significance: demonstrates explicit reward-based reasoning for multi-agent social navigation with limited demonstrations and points to bounds on trade-offs and integration of social context for enhanced realism.

Abstract

Social robot navigation in crowded public spaces such as university campuses, restaurants, grocery stores, and hospitals, is an increasingly important area of research. One of the core strategies for achieving this goal is to understand humans' intent--underlying psychological factors that govern their motion--by learning their reward functions, typically via inverse reinforcement learning (IRL). Despite significant progress in IRL, learning reward functions of multiple agents simultaneously in dense unstructured pedestrian crowds has remained intractable due to the nature of the tightly coupled social interactions that occur in these scenarios \textit{e.g.} passing, intersections, swerving, weaving, etc. In this paper, we present a new multi-agent maximum entropy inverse reinforcement learning algorithm for real world unstructured pedestrian crowds. Key to our approach is a simple, but effective, mathematical trick which we name the so-called tractability-rationality trade-off trick that achieves tractability at the cost of a slight reduction in accuracy. We compare our approach to the classical single-agent MaxEnt IRL as well as state-of-the-art trajectory prediction methods on several datasets including the ETH, UCY, SCAND, JRDB, and a new dataset, called Speedway, collected at a busy intersection on a University campus focusing on dense, complex agent interactions. Our key findings show that, on the dense Speedway dataset, our approach ranks 1st among top 7 baselines with >2X improvement over single-agent IRL, and is competitive with state-of-the-art large transformer-based encoder-decoder models on sparser datasets such as ETH/UCY (ranks 3rd among top 7 baselines).

Multi-Agent Inverse Reinforcement Learning in Real World Unstructured Pedestrian Crowds

TL;DR

Problem: learn human intent for socially compliant robot navigation in real-world crowds. Approach: multi-agent MaxEnt IRL with a tractability-rationality trade-off, leveraging a constant-velocity dynamics approximation, Taylor-based quadratic cost surrogates, and Gaussian policies rolled out via a dynamic game solver; the key trick is a covariance modification . Contributions: (i) a tractable MAIRL framework for unstructured crowds, (ii) a Hessian-based tractability technique enabling learning with limited data, and (iii) empirical gains on Speedway with competitive performance on ETH/UCY/JRDB/SCAND. Significance: demonstrates explicit reward-based reasoning for multi-agent social navigation with limited demonstrations and points to bounds on trade-offs and integration of social context for enhanced realism.

Abstract

Social robot navigation in crowded public spaces such as university campuses, restaurants, grocery stores, and hospitals, is an increasingly important area of research. One of the core strategies for achieving this goal is to understand humans' intent--underlying psychological factors that govern their motion--by learning their reward functions, typically via inverse reinforcement learning (IRL). Despite significant progress in IRL, learning reward functions of multiple agents simultaneously in dense unstructured pedestrian crowds has remained intractable due to the nature of the tightly coupled social interactions that occur in these scenarios \textit{e.g.} passing, intersections, swerving, weaving, etc. In this paper, we present a new multi-agent maximum entropy inverse reinforcement learning algorithm for real world unstructured pedestrian crowds. Key to our approach is a simple, but effective, mathematical trick which we name the so-called tractability-rationality trade-off trick that achieves tractability at the cost of a slight reduction in accuracy. We compare our approach to the classical single-agent MaxEnt IRL as well as state-of-the-art trajectory prediction methods on several datasets including the ETH, UCY, SCAND, JRDB, and a new dataset, called Speedway, collected at a busy intersection on a University campus focusing on dense, complex agent interactions. Our key findings show that, on the dense Speedway dataset, our approach ranks 1st among top 7 baselines with >2X improvement over single-agent IRL, and is competitive with state-of-the-art large transformer-based encoder-decoder models on sparser datasets such as ETH/UCY (ranks 3rd among top 7 baselines).
Paper Structure (15 sections, 9 equations, 4 figures, 2 tables)

This paper contains 15 sections, 9 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Flow diagram of the multi-agent inverse reinforcement learning algorithm. The algorithm begins by taking as input the system dynamics, initial state, dataset, feature function, and initial parameters. At each time step, it computes the expected features based on the current policy. These features are used to update the weights by minimizing the difference between the expected and ground truth features. The updated weights are then used to update the game-theoretic policy and roll out a new set of trajectories. If the Hessian matrix is non-positive definite, the "Tractability-Rationality trade-off" trick is applied to ensure tractability in unstructured environments.
  • Figure 2: Visualizing and comparing the entropy of trajectories in the Speedway dataset ($1.061$ bits) with the entropy of trajectories in the Waymo and INTERACTION datasets ($0.336$ and $0.475$ bits, respectively). We observe that trajectories in the Speedway dataset are denser, more unstructured, and have a higher entropy.
  • Figure 3: Qualitative comparison--Each color represents a pedestrian. Solid faded lines represent demonstration trajectories and dashed lines represent trajectories generated from the learned policies. The directions of movement for the pedestrians are north to south ($\mathbf{\downarrow}$), east to west ($\mathbf{\textcolor{blue}{\leftarrow}}$) and west to east ($\mathbf{\textcolor{red}{\rightarrow}}$). 'GT' and 'Pred' refer to ground truth and predicted trajectories. We inspect how closely the predicted trajectories align with the ground truth distribution.
  • Figure 4: Cumulative RMSE distribution demonstrates the percentage of trajectories below a certain RMSE threshold; steeper curves indicate a more effective learner.