Table of Contents
Fetching ...

CrowdSurfer: Sampling Optimization Augmented with Vector-Quantized Variational AutoEncoder for Dense Crowd Navigation

Naman Kumar, Antareep Singha, Laksh Nanwani, Dhruv Potdar, Tarun R, Fatemeh Rastgar, Simon Idoko, Arun Kumar Singh, K. Madhava Krishna

TL;DR

CrowdSurfer tackles dense-cCrowd navigation by decoupling trajectory generation from real-time optimization. It learns a discrete latent prior over expert trajectories with VQ-VAE, samples diverse priors via a perception-conditioned PixelCNN, and refines them at inference time using the PRIEST planner to satisfy kinematic and collision constraints, achieving real-time performance (~$20\,\mathrm{Hz}$) for a 5-second horizon. The approach improves over state-of-the-art DRL-VO in success rate and travel time across multiple environments, including unseen maps and changing layouts, and demonstrates robustness without a guaranteed global plan. Real-world experiments with Turtlebot2, Husky A200, and a custom wheelchair illustrate practical deployment challenges and the potential for broader adoption, while ablations highlight the benefits of environment-conditioned priors. Overall, CrowdSurfer offers a middle-ground framework that leverages expert demonstrations to inform long-horizon local planning with strong robustness to dynamics and map changes.

Abstract

Navigation amongst densely packed crowds remains a challenge for mobile robots. The complexity increases further if the environment layout changes, making the prior computed global plan infeasible. In this paper, we show that it is possible to dramatically enhance crowd navigation by just improving the local planner. Our approach combines generative modelling with inference time optimization to generate sophisticated long-horizon local plans at interactive rates. More specifically, we train a Vector Quantized Variational AutoEncoder to learn a prior over the expert trajectory distribution conditioned on the perception input. At run-time, this is used as an initialization for a sampling-based optimizer for further refinement. Our approach does not require any sophisticated prediction of dynamic obstacles and yet provides state-of-the-art performance. In particular, we compare against the recent DRL-VO approach and show a 40% improvement in success rate and a 6% improvement in travel time.

CrowdSurfer: Sampling Optimization Augmented with Vector-Quantized Variational AutoEncoder for Dense Crowd Navigation

TL;DR

CrowdSurfer tackles dense-cCrowd navigation by decoupling trajectory generation from real-time optimization. It learns a discrete latent prior over expert trajectories with VQ-VAE, samples diverse priors via a perception-conditioned PixelCNN, and refines them at inference time using the PRIEST planner to satisfy kinematic and collision constraints, achieving real-time performance (~) for a 5-second horizon. The approach improves over state-of-the-art DRL-VO in success rate and travel time across multiple environments, including unseen maps and changing layouts, and demonstrates robustness without a guaranteed global plan. Real-world experiments with Turtlebot2, Husky A200, and a custom wheelchair illustrate practical deployment challenges and the potential for broader adoption, while ablations highlight the benefits of environment-conditioned priors. Overall, CrowdSurfer offers a middle-ground framework that leverages expert demonstrations to inform long-horizon local planning with strong robustness to dynamics and map changes.

Abstract

Navigation amongst densely packed crowds remains a challenge for mobile robots. The complexity increases further if the environment layout changes, making the prior computed global plan infeasible. In this paper, we show that it is possible to dramatically enhance crowd navigation by just improving the local planner. Our approach combines generative modelling with inference time optimization to generate sophisticated long-horizon local plans at interactive rates. More specifically, we train a Vector Quantized Variational AutoEncoder to learn a prior over the expert trajectory distribution conditioned on the perception input. At run-time, this is used as an initialization for a sampling-based optimizer for further refinement. Our approach does not require any sophisticated prediction of dynamic obstacles and yet provides state-of-the-art performance. In particular, we compare against the recent DRL-VO approach and show a 40% improvement in success rate and a 6% improvement in travel time.
Paper Structure (22 sections, 6 equations, 6 figures, 3 tables)

This paper contains 22 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The proposed method, CrowdSurfer, first generates a diverse set of samples using the VQ-VAE + PixelCNN pipeline, followed by inference-time optimization via the PRIEST planner rastgar2024priest, resulting in trajectories that can traverse complex dynamic navigation scenarios successfully. From left to right, we show HuskyA200, our Autonomous Wheelchair, and a simulated turtlebot navigating in crowded environments using CrowdSurfer.
  • Figure 2: The VQ-VAE pipeline is used to learn the discrete latent space of optimal demonstration trajectories. We train the VQ-VAE to predict polynomial coefficients that are converted to trajectories through \ref{['parametrized']}. This ensures that the reconstructed trajectories have higher continuity and differentiability.
  • Figure 3: It is possible to express the learned discrete latent space $\mathbf{Z}_q$ as a vector $\mathbf{h}_{q}$ whose elements signify which code-book vector is used to form the $i^{th}$ row of $\mathbf{Z}_q$. Our PixelCNN model outputs a multinomial probability distribution over $\mathbf{h}_{q}$. The number of discrete probabilities $p_k$ is equal to the number of code-book vectors. Thus, the PixelCNN model decides the probability that the $i^{th}$ element of $\mathbf{h}_q$ is formed by the $r^{th}$ code-book vector. Sampling from the multinomial distribution allows the generation of different $\mathbf{h}_q$, each leading to a distinct trajectory.
  • Figure 4: At any given planning step, the VQ-VAE+PixelCNN combination takes in the perception input and generates samples for initialization of the PRIEST planner rastgar2024priest. The planner subsequently refines the predicted distribution using a combination of projection-optimizer augmented cross-entropy method wen2018constrained
  • Figure 5: Trajectories at 3 stages of the pipeline. VQVAE + PixelCNN generate diverse primitives (GREY), that may not satisfy constraints or reach the goal. These trajectories are then optimized via 2 PRIEST rastgar2024priest iterations (GREEN) to satisfy constraints, and the best trajectory as scored by PRIEST is chosen (RED).
  • ...and 1 more figures