Predicting Long-Term Human Behaviors in Discrete Representations via Physics-Guided Diffusion
Zhitian Zhang, Anjian Li, Angelica Lim, Mo Chen
TL;DR
Addresses the problem of long-horizon human trajectory forecasting by modeling the future discrete action sequence as $p(A|S)$ conditioned on past states, with $S=ig\{X,C\big\}$, and decoding to continuous trajectories via a Hierarchical Action Quantization (HAQ) VQ-VAE. The method combines a two-level HAQ to discretize trajectories, a denoising diffusion probabilistic model to sample $A$ in the discrete space using analog-bits, and reachability guidance derived from Hamilton-Jacobi reachability to bias sampling toward physically feasible sequences. The three main contributions are: (1) HAQ encoder/decoder for context- and trajectory-conditioned discrete actions, (2) a DDPM-based diffusion model operating in the discrete latent space to approximate $p(A|S)$, and (3) reachability-guided sampling that imposes physical constraints without retraining classifiers. Evaluations on SSN and JRDB demonstrate superior long-term ADE/FDE performance and competitive multimodality, indicating the approach enables safe, multimodal long-horizon planning for robotics and autonomous systems.
Abstract
Long-term human trajectory prediction is a challenging yet critical task in robotics and autonomous systems. Prior work that studied how to predict accurate short-term human trajectories with only unimodal features often failed in long-term prediction. Reinforcement learning provides a good solution for learning human long-term behaviors but can suffer from challenges in data efficiency and optimization. In this work, we propose a long-term human trajectory forecasting framework that leverages a guided diffusion model to generate diverse long-term human behaviors in a high-level latent action space, obtained via a hierarchical action quantization scheme using a VQ-VAE to discretize continuous trajectories and the available context. The latent actions are predicted by our guided diffusion model, which uses physics-inspired guidance at test time to constrain generated multimodal action distributions. Specifically, we use reachability analysis during the reverse denoising process to guide the diffusion steps toward physically feasible latent actions. We evaluate our framework on two publicly available human trajectory forecasting datasets: SFU-Store-Nav and JRDB, and extensive experimental results show that our framework achieves superior performance in long-term human trajectory forecasting.
