Table of Contents
Fetching ...

Predicting Long-Term Human Behaviors in Discrete Representations via Physics-Guided Diffusion

Zhitian Zhang, Anjian Li, Angelica Lim, Mo Chen

TL;DR

Addresses the problem of long-horizon human trajectory forecasting by modeling the future discrete action sequence as $p(A|S)$ conditioned on past states, with $S=ig\{X,C\big\}$, and decoding to continuous trajectories via a Hierarchical Action Quantization (HAQ) VQ-VAE. The method combines a two-level HAQ to discretize trajectories, a denoising diffusion probabilistic model to sample $A$ in the discrete space using analog-bits, and reachability guidance derived from Hamilton-Jacobi reachability to bias sampling toward physically feasible sequences. The three main contributions are: (1) HAQ encoder/decoder for context- and trajectory-conditioned discrete actions, (2) a DDPM-based diffusion model operating in the discrete latent space to approximate $p(A|S)$, and (3) reachability-guided sampling that imposes physical constraints without retraining classifiers. Evaluations on SSN and JRDB demonstrate superior long-term ADE/FDE performance and competitive multimodality, indicating the approach enables safe, multimodal long-horizon planning for robotics and autonomous systems.

Abstract

Long-term human trajectory prediction is a challenging yet critical task in robotics and autonomous systems. Prior work that studied how to predict accurate short-term human trajectories with only unimodal features often failed in long-term prediction. Reinforcement learning provides a good solution for learning human long-term behaviors but can suffer from challenges in data efficiency and optimization. In this work, we propose a long-term human trajectory forecasting framework that leverages a guided diffusion model to generate diverse long-term human behaviors in a high-level latent action space, obtained via a hierarchical action quantization scheme using a VQ-VAE to discretize continuous trajectories and the available context. The latent actions are predicted by our guided diffusion model, which uses physics-inspired guidance at test time to constrain generated multimodal action distributions. Specifically, we use reachability analysis during the reverse denoising process to guide the diffusion steps toward physically feasible latent actions. We evaluate our framework on two publicly available human trajectory forecasting datasets: SFU-Store-Nav and JRDB, and extensive experimental results show that our framework achieves superior performance in long-term human trajectory forecasting.

Predicting Long-Term Human Behaviors in Discrete Representations via Physics-Guided Diffusion

TL;DR

Addresses the problem of long-horizon human trajectory forecasting by modeling the future discrete action sequence as conditioned on past states, with , and decoding to continuous trajectories via a Hierarchical Action Quantization (HAQ) VQ-VAE. The method combines a two-level HAQ to discretize trajectories, a denoising diffusion probabilistic model to sample in the discrete space using analog-bits, and reachability guidance derived from Hamilton-Jacobi reachability to bias sampling toward physically feasible sequences. The three main contributions are: (1) HAQ encoder/decoder for context- and trajectory-conditioned discrete actions, (2) a DDPM-based diffusion model operating in the discrete latent space to approximate , and (3) reachability-guided sampling that imposes physical constraints without retraining classifiers. Evaluations on SSN and JRDB demonstrate superior long-term ADE/FDE performance and competitive multimodality, indicating the approach enables safe, multimodal long-horizon planning for robotics and autonomous systems.

Abstract

Long-term human trajectory prediction is a challenging yet critical task in robotics and autonomous systems. Prior work that studied how to predict accurate short-term human trajectories with only unimodal features often failed in long-term prediction. Reinforcement learning provides a good solution for learning human long-term behaviors but can suffer from challenges in data efficiency and optimization. In this work, we propose a long-term human trajectory forecasting framework that leverages a guided diffusion model to generate diverse long-term human behaviors in a high-level latent action space, obtained via a hierarchical action quantization scheme using a VQ-VAE to discretize continuous trajectories and the available context. The latent actions are predicted by our guided diffusion model, which uses physics-inspired guidance at test time to constrain generated multimodal action distributions. Specifically, we use reachability analysis during the reverse denoising process to guide the diffusion steps toward physically feasible latent actions. We evaluate our framework on two publicly available human trajectory forecasting datasets: SFU-Store-Nav and JRDB, and extensive experimental results show that our framework achieves superior performance in long-term human trajectory forecasting.
Paper Structure (18 sections, 15 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 15 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: We propose a reachability-guided diffusion model (Left) for generating long-term human behaviors. Our model works in the discrete action space (Right). We visualize a few examples of learned discrete actions from continuous trajectory space using VQ-VAEs. Visualization is done using the SSN 3D virtual human platform birosfu.
  • Figure 2: Illustration of past horizon $H$ and future horizon $T$ in continuous trajectory space (Top). A discrete action $\mathbf{a}^\tau$ tokenizes all states in a period of $T_{vq}$ in continuous trajectory space (Bottom).
  • Figure 3: Overview of our framework. Hierarchical Action Quantization (HAQ) encoder learns a discrete representation of human behaviors. Our diffusion policy generates 6 discrete future actions conditioned on past observations. During each reverse denoising process, reachability guidance is used to enforce some physical constraints. The final output is a long-term future human trajectory reconstructed from discrete future actions using the HAQ decoder.
  • Figure 4: Comparison between different forecasting timesteps (T = 10, 20 ,30) on SSN and JRDB datasets in terms of ADE. Our model is better at generating long-term future trajectories, while still maintaining comparable performance in short-term prediction.