Table of Contents
Fetching ...

SocialCVAE: Predicting Pedestrian Trajectory via Interaction Conditioned Latents

Wei Xiang, Haoteng Yin, He Wang, Xiaogang Jin

TL;DR

This work tackles pedestrian trajectory prediction by balancing accuracy with interpretability. It introduces SocialCVAE, a hybrid model that blends an energy-based interaction mechanism with an interaction-conditioned CVAE to capture multimodal and socially aware motion, conditioning the CVAE on a socially explainable energy map. The coarse motion predictor, energy map, and CVAE work in a recursive framework to produce diverse future trajectories while accounting for neighbors and static/dynamic obstacles. Empirical results on ETH-UCY and SDD show state-of-the-art ADE/FDE improvements, validating the value of explicitly modeling socially influenced randomness for pedestrian motion forecasting.

Abstract

Pedestrian trajectory prediction is the key technology in many applications for providing insights into human behavior and anticipating human future motions. Most existing empirical models are explicitly formulated by observed human behaviors using explicable mathematical terms with a deterministic nature, while recent work has focused on developing hybrid models combined with learning-based techniques for powerful expressiveness while maintaining explainability. However, the deterministic nature of the learned steering behaviors from the empirical models limits the models' practical performance. To address this issue, this work proposes the social conditional variational autoencoder (SocialCVAE) for predicting pedestrian trajectories, which employs a CVAE to explore behavioral uncertainty in human motion decisions. SocialCVAE learns socially reasonable motion randomness by utilizing a socially explainable interaction energy map as the CVAE's condition, which illustrates the future occupancy of each pedestrian's local neighborhood area. The energy map is generated using an energy-based interaction model, which anticipates the energy cost (i.e., repulsion intensity) of pedestrians' interactions with neighbors. Experimental results on two public benchmarks including 25 scenes demonstrate that SocialCVAE significantly improves prediction accuracy compared with the state-of-the-art methods, with up to 16.85% improvement in Average Displacement Error (ADE) and 69.18% improvement in Final Displacement Error (FDE).

SocialCVAE: Predicting Pedestrian Trajectory via Interaction Conditioned Latents

TL;DR

This work tackles pedestrian trajectory prediction by balancing accuracy with interpretability. It introduces SocialCVAE, a hybrid model that blends an energy-based interaction mechanism with an interaction-conditioned CVAE to capture multimodal and socially aware motion, conditioning the CVAE on a socially explainable energy map. The coarse motion predictor, energy map, and CVAE work in a recursive framework to produce diverse future trajectories while accounting for neighbors and static/dynamic obstacles. Empirical results on ETH-UCY and SDD show state-of-the-art ADE/FDE improvements, validating the value of explicitly modeling socially influenced randomness for pedestrian motion forecasting.

Abstract

Pedestrian trajectory prediction is the key technology in many applications for providing insights into human behavior and anticipating human future motions. Most existing empirical models are explicitly formulated by observed human behaviors using explicable mathematical terms with a deterministic nature, while recent work has focused on developing hybrid models combined with learning-based techniques for powerful expressiveness while maintaining explainability. However, the deterministic nature of the learned steering behaviors from the empirical models limits the models' practical performance. To address this issue, this work proposes the social conditional variational autoencoder (SocialCVAE) for predicting pedestrian trajectories, which employs a CVAE to explore behavioral uncertainty in human motion decisions. SocialCVAE learns socially reasonable motion randomness by utilizing a socially explainable interaction energy map as the CVAE's condition, which illustrates the future occupancy of each pedestrian's local neighborhood area. The energy map is generated using an energy-based interaction model, which anticipates the energy cost (i.e., repulsion intensity) of pedestrians' interactions with neighbors. Experimental results on two public benchmarks including 25 scenes demonstrate that SocialCVAE significantly improves prediction accuracy compared with the state-of-the-art methods, with up to 16.85% improvement in Average Displacement Error (ADE) and 69.18% improvement in Final Displacement Error (FDE).
Paper Structure (38 sections, 15 equations, 9 figures, 5 tables)

This paper contains 38 sections, 15 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The framework of SocialCVAE. (a) The coarse motion prediction model learns the temporal motion tendencies and predicts a preferred new velocity for each pedestrian. (b) The energy-based interaction model constructs a local interaction energy map to anticipate the cost of pedestrian interactions with heterogeneous neighbors, including pedestrians, static environmental obstacles found in the scene segmentation (e.g., buildings), and dynamic environmental obstacles (e.g., vehicles). (c) The multimodal prediction model predicts future trajectories using a CVAE model conditioning on the past trajectories and the interaction energy map.
  • Figure 2: An illustration of the discrete candidate velocity space $V_{i}^t$ in a polar coordination system. The blue point represents the time extrapolation velocity $\bar{\mathbf{v}}_{i}^{t+1}$, with $r_i^t$ and $\theta_i^t$ denoting the magnitude and angle of $\bar{\mathbf{v}}_{i}^{t+1}$. The polar space is discretized into a grid, with a predefined cell side length $\Delta r$ and $\Delta \theta$ for the magnitude and angle axes. The velocity candidates in $V_{i}^t$ are represented by the intersection points of the solid lines, centered at $\bar{\mathbf{v}}_{i}^{t+1}$ within $k_r$ grid cells on the magnitude axis and $k_\theta$ on the angle axis.
  • Figure 3: An example of a pedestrian's square-shaped local interaction area. (a) The focal pedestrian at the center of the orange square interacts with heterogeneous neighbors within the square. (b) The interaction energy is recorded at the predicted local location of each neighbor (The darker color represents the higher value of interaction energy).
  • Figure 4: The architecture of the interaction-conditioned CVAE model. $\bigoplus$ represents the concatenation operation. Red dotted lines denote that the layers are only performed during training. All the components in the CVAE model are built using MLPs.
  • Figure 5: Visualization results of our method. The visualized trajectories are the best predictions sampled from 20 trials. The white, green, and red dots represent the observed, ground-truth, and predicted trajectories respectively.
  • ...and 4 more figures