Table of Contents
Fetching ...

Social-IWSTCNN: A Social Interaction-Weighted Spatio-Temporal Convolutional Neural Network for Pedestrian Trajectory Prediction in Urban Traffic Scenarios

Chi Zhang, Christian Berger, Marco Dozza

TL;DR

The paper addresses pedestrian trajectory prediction in urban traffic by learning data-driven social interaction weights from relative positions using a Social Interaction Extractor. It introduces Social-IWSTCNN, a model that combines spatial-social feature extraction, Temporal Convolutional Networks, and a Time-Extrapolator CNN to predict a bi-variate Gaussian distribution for each pedestrian, with parameters $(\mu_x, \mu_y, \sigma_x, \sigma_y, \rho)$. On the Waymo Open Dataset, it outperforms state-of-the-art methods such as Social-LSTM, Social-GAN, and Social-STGCNN in ADE and FDE, while delivering substantial speedups in data preprocessing (≈$54.8\times$) and total test time (≈$4.7\times$). The work demonstrates robust performance in densely populated urban scenarios and highlights future opportunities to incorporate vehicle and environment cues to further improve prediction, especially in sparser contexts.

Abstract

Pedestrian trajectory prediction in urban scenarios is essential for automated driving. This task is challenging because the behavior of pedestrians is influenced by both their own history paths and the interactions with others. Previous research modeled these interactions with pooling mechanisms or aggregating with hand-crafted attention weights. In this paper, we present the Social Interaction-Weighted Spatio-Temporal Convolutional Neural Network (Social-IWSTCNN), which includes both the spatial and the temporal features. We propose a novel design, namely the Social Interaction Extractor, to learn the spatial and social interaction features of pedestrians. Most previous works used ETH and UCY datasets which include five scenes but do not cover urban traffic scenarios extensively for training and evaluation. In this paper, we use the recently released large-scale Waymo Open Dataset in urban traffic scenarios, which includes 374 urban training scenes and 76 urban testing scenes to analyze the performance of our proposed algorithm in comparison to the state-of-the-art (SOTA) models. The results show that our algorithm outperforms SOTA algorithms such as Social-LSTM, Social-GAN, and Social-STGCNN on both Average Displacement Error (ADE) and Final Displacement Error (FDE). Furthermore, our Social-IWSTCNN is 54.8 times faster in data pre-processing speed, and 4.7 times faster in total test speed than the current best SOTA algorithm Social-STGCNN.

Social-IWSTCNN: A Social Interaction-Weighted Spatio-Temporal Convolutional Neural Network for Pedestrian Trajectory Prediction in Urban Traffic Scenarios

TL;DR

The paper addresses pedestrian trajectory prediction in urban traffic by learning data-driven social interaction weights from relative positions using a Social Interaction Extractor. It introduces Social-IWSTCNN, a model that combines spatial-social feature extraction, Temporal Convolutional Networks, and a Time-Extrapolator CNN to predict a bi-variate Gaussian distribution for each pedestrian, with parameters . On the Waymo Open Dataset, it outperforms state-of-the-art methods such as Social-LSTM, Social-GAN, and Social-STGCNN in ADE and FDE, while delivering substantial speedups in data preprocessing (≈) and total test time (≈). The work demonstrates robust performance in densely populated urban scenarios and highlights future opportunities to incorporate vehicle and environment cues to further improve prediction, especially in sparser contexts.

Abstract

Pedestrian trajectory prediction in urban scenarios is essential for automated driving. This task is challenging because the behavior of pedestrians is influenced by both their own history paths and the interactions with others. Previous research modeled these interactions with pooling mechanisms or aggregating with hand-crafted attention weights. In this paper, we present the Social Interaction-Weighted Spatio-Temporal Convolutional Neural Network (Social-IWSTCNN), which includes both the spatial and the temporal features. We propose a novel design, namely the Social Interaction Extractor, to learn the spatial and social interaction features of pedestrians. Most previous works used ETH and UCY datasets which include five scenes but do not cover urban traffic scenarios extensively for training and evaluation. In this paper, we use the recently released large-scale Waymo Open Dataset in urban traffic scenarios, which includes 374 urban training scenes and 76 urban testing scenes to analyze the performance of our proposed algorithm in comparison to the state-of-the-art (SOTA) models. The results show that our algorithm outperforms SOTA algorithms such as Social-LSTM, Social-GAN, and Social-STGCNN on both Average Displacement Error (ADE) and Final Displacement Error (FDE). Furthermore, our Social-IWSTCNN is 54.8 times faster in data pre-processing speed, and 4.7 times faster in total test speed than the current best SOTA algorithm Social-STGCNN.

Paper Structure

This paper contains 19 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overall framework of Social-IWSTCNN. Given observed frame sequences, we use the positions in each frame as input to learn the social interaction weights, and extract spatial and social interaction features using Social Interaction Extractor. Following this, we apply TCNs to create spatio-temporal features for each pedestrian. Then we apply Time-Exgrapolator CNNs to predict future trajectory distributions. Finally we sample to get the predicted trajectories.
  • Figure 2: Pedestrian Social Interaction Extractor. The input are the relative positions to last frame and pedestrian positions of each time-step. We use MLP to learn the social interaction weights, and use an aggregate function to extract the spatial and social interaction features.
  • Figure 3: The comparison of prediction results of different algorithms in various scenarios. (a) Two individuals walking in parallel, (b) two individuals walking in parallel and merging, (c) collision avoidance, the individual from the top right corner tend to avoid the pedestrians on the bottom left corner, and (d) individuals from the top left corner meeting a group from the bottom right corner. The legends: obs stands for observed paths; gt stands for the ground truth of predicted trajectories. s-lstm stands for Social-lstm; s-gan stands for Social-GAN; s-stgcnn stands for Social-STGCNN; and s-iwstcnn stands for our proposed method Social-IWSTCNN.
  • Figure 4: The comparison of prediction results of different algorithms in the scenarios that are difficult to predict. (a) Trajectory prediction in a densely populated scenario. Our Social-IWSTCNN manages to capture the social interactions and outperforms other methods. (b) Failure in predicting individual changing direction suddenly, (c) failure in predicting individual changing speed suddenly, and (d) failure in collision avoidance. In (b), (c), and (d), none of the algorithms succeed in predicting the trajectories correctly, because of the lack of sufficient information.