Table of Contents
Fetching ...

HEIGHT: Heterogeneous Interaction Graph Transformer for Robot Navigation in Crowded and Constrained Environments

Shuijing Liu, Haochen Xia, Fatemeh Cheraghi Pouria, Kaiwen Hong, Neeloy Chakraborty, Zichao Hu, Joydeep Biswas, Katherine Driggs-Campbell

TL;DR

This work tackles robot navigation in crowded and constrained indoor environments by introducing HEIGHT, a structured policy built on a heterogeneous spatio-temporal graph. By splitting scene inputs into human dynamics and obstacle geometry and applying separate attention mechanisms for robot-human and human-human interactions, HEIGHT achieves robust long-horizon reasoning and adaptive collision avoidance. Extensive simulations and real-world deployments demonstrate superior performance over baselines in success, time efficiency, and generalization to distribution shifts, with notable sim2real transfer advantages. The approach highlights the value of explicit scene structure and edge-type specialization for multi-agent navigation in complex environments.

Abstract

We study the problem of robot navigation in dense and interactive crowds with static constraints such as corridors and furniture. Previous methods fail to consider all types of spatial and temporal interactions among agents and obstacles, leading to unsafe and inefficient robot paths. In this article, we leverage a graph-based representation of crowded and constrained scenarios and propose a structured framework to learn robot navigation policies with deep reinforcement learning. We first split the representations of different inputs and propose a heterogeneous spatio-temporal graph to model distinct interactions among humans, robots, and obstacles. Based on the heterogeneous spatio-temporal graph, we propose HEIGHT, a novel navigation policy network architecture with different components to capture heterogeneous interactions through space and time. HEIGHT utilizes attention mechanisms to prioritize important interactions and a recurrent network to track changes in the dynamic scene over time, encouraging the robot to avoid collisions adaptively. Through extensive simulation and real-world experiments, we demonstrate that HEIGHT outperforms state-of-the-art baselines in terms of success, navigation time, and generalization to domain shifts in challenging navigation scenarios. More information is available at https://sites.google.com/view/crowdnav-height/home.

HEIGHT: Heterogeneous Interaction Graph Transformer for Robot Navigation in Crowded and Constrained Environments

TL;DR

This work tackles robot navigation in crowded and constrained indoor environments by introducing HEIGHT, a structured policy built on a heterogeneous spatio-temporal graph. By splitting scene inputs into human dynamics and obstacle geometry and applying separate attention mechanisms for robot-human and human-human interactions, HEIGHT achieves robust long-horizon reasoning and adaptive collision avoidance. Extensive simulations and real-world deployments demonstrate superior performance over baselines in success, time efficiency, and generalization to distribution shifts, with notable sim2real transfer advantages. The approach highlights the value of explicit scene structure and edge-type specialization for multi-agent navigation in complex environments.

Abstract

We study the problem of robot navigation in dense and interactive crowds with static constraints such as corridors and furniture. Previous methods fail to consider all types of spatial and temporal interactions among agents and obstacles, leading to unsafe and inefficient robot paths. In this article, we leverage a graph-based representation of crowded and constrained scenarios and propose a structured framework to learn robot navigation policies with deep reinforcement learning. We first split the representations of different inputs and propose a heterogeneous spatio-temporal graph to model distinct interactions among humans, robots, and obstacles. Based on the heterogeneous spatio-temporal graph, we propose HEIGHT, a novel navigation policy network architecture with different components to capture heterogeneous interactions through space and time. HEIGHT utilizes attention mechanisms to prioritize important interactions and a recurrent network to track changes in the dynamic scene over time, encouraging the robot to avoid collisions adaptively. Through extensive simulation and real-world experiments, we demonstrate that HEIGHT outperforms state-of-the-art baselines in terms of success, navigation time, and generalization to domain shifts in challenging navigation scenarios. More information is available at https://sites.google.com/view/crowdnav-height/home.

Paper Structure

This paper contains 46 sections, 10 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: A heterogeneous graph aids the robot's spatio-temporal reasoning in a crowded and constrained environment. The colored arrows denote robot-human (RH), human-human (HH), and obstacle-agent (OA) interactions. The opaque arrows are the more important interactions, while the transparent arrows are the less important ones. The white arrows indicate the front direction of the robot and its length indicates the robot speed. At each timestep $t$, the robot reasons about these interactions, focuses on the important ones, and makes decisions.
  • Figure 2: An overview of our pipeline in simulation and real-world. (a) At each timestep $t$ in training and testing, the simulator provides a reward $r^t$ and the following observations of the environment: obstacle point cloud $o^t$, the robot state $w^t$, and the human states $h_1^t, ..., h_n^t$, and masks $M^t$ (Sec. \ref{['sec:method_attn']}). These observations serve as inputs to HEIGHT, which outputs a robot action $a^t$ that maximizes the future expected return $R^t$. The simulator executes the actions of all agents and the loop continues. (b) The testing loop in the real-world is similar to the simulator except the perception modules for obtaining the observations are different and the reward is absent.
  • Figure 3: A split representation of a crowded and constrained navigation scenario. In (b), individual human states are represented by circles. To represent static obstacles, we preprocess and smooth the original map (a). From this processed map and the robot's localization, we generate an artificial point cloud (c) that simulates obstacle perception.
  • Figure 4: The heterogeneous st-graph and the HEIGHT network architecture. (a) Graph representation of crowd navigation. The robot node is $w$ (pink), the $i$-th human node is $\mathrm{h}_i$ (white), and the obstacle node is $o$ (yellow). HH edges and HH functions are in blue, OA edges and OA functions are in orange, and RH edges and RH functions are in red. The temporal function is in purple. (b) HEIGHT network. Two attention mechanisms are used to model the HH and RH interactions. We use MLPs and a concatenation for obstacle-agent interactions, and a GRU for the temporal function. The superscript $t$ that indicates the timestep and the human mask $M$ is eliminated for clarity.
  • Figure 5: Comparison of different methods in the same testing episode in More Constrained environment. The robot is centered in white circles and its orientation is denoted by white arrows. More qualitative results can be found in the video attachment.
  • ...and 8 more figures