Synthetic Data Generation Framework, Dataset, and Efficient Deep Model for Pedestrian Intention Prediction
Muhammad Naveed Riaz, Maciej Wielgosz, Abel Garcia Romera, Antonio M. Lopez
TL;DR
Problem: predicting pedestrian crossing intention (C/NC) for safe autonomous driving under data scarcity for diverse scenes. Approach: ARCANE, built on CARLA, programmatically generates synthetic C/NC video clips to form PedSynth (~947 clips, ~398K frames) and uses PedGNN, a lightweight predictor that processes sequences of length $N_F$ frames of 19-joint skeleton graphs via a graph-convolutional gated recurrent unit to output C/NC. Key findings: PedSynth complements JAAD and PIE for training; PedGNN achieves strong F1 with a tiny memory footprint (~27 KB) and fast inference (~0.6 ms on a GTX 1080), and synth data can also serve as testing data. Significance: the framework enables diverse, controllable onboard training data and efficient C/NC prediction, with promising directions for synth-to-real unsupervised domain adaptation.
Abstract
Pedestrian intention prediction is crucial for autonomous driving. In particular, knowing if pedestrians are going to cross in front of the ego-vehicle is core to performing safe and comfortable maneuvers. Creating accurate and fast models that predict such intentions from sequential images is challenging. A factor contributing to this is the lack of datasets with diverse crossing and non-crossing (C/NC) scenarios. We address this scarceness by introducing a framework, named ARCANE, which allows programmatically generating synthetic datasets consisting of C/NC video clip samples. As an example, we use ARCANE to generate a large and diverse dataset named PedSynth. We will show how PedSynth complements widely used real-world datasets such as JAAD and PIE, so enabling more accurate models for C/NC prediction. Considering the onboard deployment of C/NC prediction models, we also propose a deep model named PedGNN, which is fast and has a very low memory footprint. PedGNN is based on a GNN-GRU architecture that takes a sequence of pedestrian skeletons as input to predict crossing intentions.
