Table of Contents
Fetching ...

Context-aware Multi-task Learning for Pedestrian Intent and Trajectory Prediction

Farzeen Munir, Tomasz Piotr Kucner

TL;DR

PTINet tackles the intertwined problem of pedestrian trajectory and crossing intention by fusing past motion with both local pedestrian attributes and global scene context in a unified multi-task framework. The architecture combines a Position-Velocity Encoding Module (LSTM-VAE), a Global Feature Module (image and optical flow via CLSTM and ResNet-50), and a Local Contextual Feature module, feeding two decoders that jointly predict future bounding boxes and crossing probabilities. Evaluations on JAAD and PIE show state-of-the-art ADE/FDE scores across multiple horizons and high F1-score and accuracy for intention, validating the advantage of jointly modeling trajectory and intention with rich contextual cues. The approach demonstrates practical potential for safer autonomous driving by enabling more accurate anticipation of pedestrian behavior in urban environments.

Abstract

The advancement of socially-aware autonomous vehicles hinges on precise modeling of human behavior. Within this broad paradigm, the specific challenge lies in accurately predicting pedestrian's trajectory and intention. Traditional methodologies have leaned heavily on historical trajectory data, frequently overlooking vital contextual cues such as pedestrian-specific traits and environmental factors. Furthermore, there's a notable knowledge gap as trajectory and intention prediction have largely been approached as separate problems, despite their mutual dependence. To bridge this gap, we introduce PTINet (Pedestrian Trajectory and Intention Prediction Network), which jointly learns the trajectory and intention prediction by combining past trajectory observations, local contextual features (individual pedestrian behaviors), and global features (signs, markings etc.). The efficacy of our approach is evaluated on widely used public datasets: JAAD and PIE, where it has demonstrated superior performance over existing state-of-the-art models in trajectory and intention prediction. The results from our experiments and ablation studies robustly validate PTINet's effectiveness in jointly exploring intention and trajectory prediction for pedestrian behaviour modelling. The experimental evaluation indicates the advantage of using global and local contextual features for pedestrian trajectory and intention prediction. The effectiveness of PTINet in predicting pedestrian behavior paves the way for the development of automated systems capable of seamlessly interacting with pedestrians in urban settings.

Context-aware Multi-task Learning for Pedestrian Intent and Trajectory Prediction

TL;DR

PTINet tackles the intertwined problem of pedestrian trajectory and crossing intention by fusing past motion with both local pedestrian attributes and global scene context in a unified multi-task framework. The architecture combines a Position-Velocity Encoding Module (LSTM-VAE), a Global Feature Module (image and optical flow via CLSTM and ResNet-50), and a Local Contextual Feature module, feeding two decoders that jointly predict future bounding boxes and crossing probabilities. Evaluations on JAAD and PIE show state-of-the-art ADE/FDE scores across multiple horizons and high F1-score and accuracy for intention, validating the advantage of jointly modeling trajectory and intention with rich contextual cues. The approach demonstrates practical potential for safer autonomous driving by enabling more accurate anticipation of pedestrian behavior in urban environments.

Abstract

The advancement of socially-aware autonomous vehicles hinges on precise modeling of human behavior. Within this broad paradigm, the specific challenge lies in accurately predicting pedestrian's trajectory and intention. Traditional methodologies have leaned heavily on historical trajectory data, frequently overlooking vital contextual cues such as pedestrian-specific traits and environmental factors. Furthermore, there's a notable knowledge gap as trajectory and intention prediction have largely been approached as separate problems, despite their mutual dependence. To bridge this gap, we introduce PTINet (Pedestrian Trajectory and Intention Prediction Network), which jointly learns the trajectory and intention prediction by combining past trajectory observations, local contextual features (individual pedestrian behaviors), and global features (signs, markings etc.). The efficacy of our approach is evaluated on widely used public datasets: JAAD and PIE, where it has demonstrated superior performance over existing state-of-the-art models in trajectory and intention prediction. The results from our experiments and ablation studies robustly validate PTINet's effectiveness in jointly exploring intention and trajectory prediction for pedestrian behaviour modelling. The experimental evaluation indicates the advantage of using global and local contextual features for pedestrian trajectory and intention prediction. The effectiveness of PTINet in predicting pedestrian behavior paves the way for the development of automated systems capable of seamlessly interacting with pedestrians in urban settings.
Paper Structure (21 sections, 9 equations, 6 figures, 4 tables)

This paper contains 21 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The figure illustrates a context-aware multi-task learning framework for the prediction of pedestrian trajectories and intentions. The architecture comprises a Global Feature Module, which processes image and optical flow data utilizing a clstm module and Resnet50, respectively. Concurrently, the Local Contextual Module takes in local contextual features and employs a combination of MLP and LSTM-VAE blocks for feature extraction. The Position-Velocity Encoding Module encodes past pedestrian trajectories. The outputs from these distinct modules are concatenated and fed into separate trajectory and intention decoders, facilitating subsequent predictions
  • Figure 2: The clstm module, designed to process input images and generate gf. The detailed framework of the module comprises three blocks of clstm, each followed by max pooling. The last block incorporates a max pooling layer followed by a fully connected layer.
  • Figure 3: illustrates the architecture of the LSTM-VAE module employed in PTINet, which is utilized for learning the LCF and for capturing the temporal representation of past trajectories.
  • Figure 4: The figure presents the qualitative results of the proposed framework on the JAAD, PIE, and Titan datasets. Red bounding boxes indicate predictions at the current timestamp, while white bounding boxes represent ground truth values. Dotted lines illustrate predicted trajectories over a 0.5s time horizon, with blue indicating ground truth and red showing predicted values. The bar graph displays the pedestrian’s intentions, providing a comprehensive view of the model’s performance.
  • Figure 5: Bar plots show the comparative performance of ADE and FDE scores for PTINet, PTINet without image data, and PTINet without optical flow on the JAAD, PIE, and TITAN datasets.
  • ...and 1 more figures