Snapshot: Towards Application-centered Models for Pedestrian Trajectory Prediction in Urban Traffic Environments

Nico Uhlemann; Yipeng Zhou; Tobias Simeon Mohr; Markus Lienkamp

Snapshot: Towards Application-centered Models for Pedestrian Trajectory Prediction in Urban Traffic Environments

Nico Uhlemann, Yipeng Zhou, Tobias Simeon Mohr, Markus Lienkamp

TL;DR

The paper tackles real-world pedestrian trajectory prediction in urban traffic by introducing Snapshot, a compact, unimodal, feed-forward predictor with two dedicated encoders for social and map information. It couples a novel, agent-centric encoding with a cross-attention map encoder and a CNN-based decoder to forecast up to $T_p=60$ timesteps from a short observation horizon $T_o=10$, while maintaining real-time performance. A dedicated Argoverse 2 pedestrian benchmark is proposed, derived from a large-scale dataset and augmented via sliding-window sampling to produce over $10^6$ training/validation samples; Snapshot achieves an ADE improvement of $8.8 ext{%}$ over state-of-the-art baselines and strong robustness to varying histories. Real-world applicability is demonstrated by integrating Snapshot into an autonomous driving stack, showing reliable predictions under noisy detections and confirming the model's suitability for real-time deployment in urban environments.

Abstract

This paper explores pedestrian trajectory prediction in urban traffic while focusing on both model accuracy and real-world applicability. While promising approaches exist, they often revolve around pedestrian datasets excluding traffic-related information, or resemble architectures that are either not real-time capable or robust. To address these limitations, we first introduce a dedicated benchmark based on Argoverse 2, specifically targeting pedestrians in traffic environments. Following this, we present Snapshot, a modular, feed-forward neural network that outperforms the current state of the art, reducing the Average Displacement Error (ADE) by 8.8% while utilizing significantly less information. Despite its agent-centric encoding scheme, Snapshot demonstrates scalability, real-time performance, and robustness to varying motion histories. Moreover, by integrating Snapshot into a modular autonomous driving software stack, we showcase its real-world applicability.

Snapshot: Towards Application-centered Models for Pedestrian Trajectory Prediction in Urban Traffic Environments

TL;DR

timesteps from a short observation horizon

, while maintaining real-time performance. A dedicated Argoverse 2 pedestrian benchmark is proposed, derived from a large-scale dataset and augmented via sliding-window sampling to produce over

training/validation samples; Snapshot achieves an ADE improvement of

over state-of-the-art baselines and strong robustness to varying histories. Real-world applicability is demonstrated by integrating Snapshot into an autonomous driving stack, showing reliable predictions under noisy detections and confirming the model's suitability for real-time deployment in urban environments.

Abstract

Paper Structure (20 sections, 3 equations, 9 figures, 3 tables)

This paper contains 20 sections, 3 equations, 9 figures, 3 tables.

Introduction
Related Work
Pedestrian Datasets
Prediction Approaches
Incorporated features
Methodology
Problem formulation
Dataset and benchmark
Feature representation
Model architecture
Training procedure
Metrics
Results
Training strategy
Quantitative results
...and 5 more sections

Figures (9)

Figure 1: Comparison showcasing a pedestrian-only scenario on the left in contrast to a traffic environment scene on the right. In both cases, pedestrians are highlighted with red circles and back heading arrows.
Figure 2: Sampling process performed through a sliding window approach, spanning 70 timesteps as visualized by the red bounding box. Pedestrian 1 (Ped 1) is marked as SCORED_TRACK as indicated by the orange color, while the remaining tracks are only employed during observation. Here, Vehicle 1 (Veh 1) is marked as TRACK_FRAGMENT and pedestrian 2 (Ped 2) as UNSCORED_TRACK.
Figure 3: Interaction features derived from crowd research. The scene shows a vehicle and the focal pedestrian moving towards one another. Given a constant velocity assumption, the time- and distance-to-closest-approach as well as the derivative of the bearing angle $\dot{\alpha}$ are calculated and used to determine the collision risk $col_{risk}$ as defined by the displayed equation.
Figure 4: Vectorized, local map with a radius $r$ of 20m centered around the focal pedestrian. Each polygon, represented by lane segment A and crosswalk B, comprises individual polylines labeled with small letters. For the input, each is transformed into a feature vector as shown on the right, where the first two entries indicate semantic type and polygon id, while the remaining four define the start- and endpoint coordinates.
Figure 5: Overview of the proposed Snapshot architecture, featuring two independent encoders for social and map information. The subsequent trajectory decoder fuses the information to produce an unimodal output.
...and 4 more figures

Snapshot: Towards Application-centered Models for Pedestrian Trajectory Prediction in Urban Traffic Environments

TL;DR

Abstract

Snapshot: Towards Application-centered Models for Pedestrian Trajectory Prediction in Urban Traffic Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (9)