Table of Contents
Fetching ...

Synthetic Data Generation Framework, Dataset, and Efficient Deep Model for Pedestrian Intention Prediction

Muhammad Naveed Riaz, Maciej Wielgosz, Abel Garcia Romera, Antonio M. Lopez

TL;DR

Problem: predicting pedestrian crossing intention (C/NC) for safe autonomous driving under data scarcity for diverse scenes. Approach: ARCANE, built on CARLA, programmatically generates synthetic C/NC video clips to form PedSynth (~947 clips, ~398K frames) and uses PedGNN, a lightweight predictor that processes sequences of length $N_F$ frames of 19-joint skeleton graphs via a graph-convolutional gated recurrent unit to output C/NC. Key findings: PedSynth complements JAAD and PIE for training; PedGNN achieves strong F1 with a tiny memory footprint (~27 KB) and fast inference (~0.6 ms on a GTX 1080), and synth data can also serve as testing data. Significance: the framework enables diverse, controllable onboard training data and efficient C/NC prediction, with promising directions for synth-to-real unsupervised domain adaptation.

Abstract

Pedestrian intention prediction is crucial for autonomous driving. In particular, knowing if pedestrians are going to cross in front of the ego-vehicle is core to performing safe and comfortable maneuvers. Creating accurate and fast models that predict such intentions from sequential images is challenging. A factor contributing to this is the lack of datasets with diverse crossing and non-crossing (C/NC) scenarios. We address this scarceness by introducing a framework, named ARCANE, which allows programmatically generating synthetic datasets consisting of C/NC video clip samples. As an example, we use ARCANE to generate a large and diverse dataset named PedSynth. We will show how PedSynth complements widely used real-world datasets such as JAAD and PIE, so enabling more accurate models for C/NC prediction. Considering the onboard deployment of C/NC prediction models, we also propose a deep model named PedGNN, which is fast and has a very low memory footprint. PedGNN is based on a GNN-GRU architecture that takes a sequence of pedestrian skeletons as input to predict crossing intentions.

Synthetic Data Generation Framework, Dataset, and Efficient Deep Model for Pedestrian Intention Prediction

TL;DR

Problem: predicting pedestrian crossing intention (C/NC) for safe autonomous driving under data scarcity for diverse scenes. Approach: ARCANE, built on CARLA, programmatically generates synthetic C/NC video clips to form PedSynth (~947 clips, ~398K frames) and uses PedGNN, a lightweight predictor that processes sequences of length frames of 19-joint skeleton graphs via a graph-convolutional gated recurrent unit to output C/NC. Key findings: PedSynth complements JAAD and PIE for training; PedGNN achieves strong F1 with a tiny memory footprint (~27 KB) and fast inference (~0.6 ms on a GTX 1080), and synth data can also serve as testing data. Significance: the framework enables diverse, controllable onboard training data and efficient C/NC prediction, with promising directions for synth-to-real unsupervised domain adaptation.

Abstract

Pedestrian intention prediction is crucial for autonomous driving. In particular, knowing if pedestrians are going to cross in front of the ego-vehicle is core to performing safe and comfortable maneuvers. Creating accurate and fast models that predict such intentions from sequential images is challenging. A factor contributing to this is the lack of datasets with diverse crossing and non-crossing (C/NC) scenarios. We address this scarceness by introducing a framework, named ARCANE, which allows programmatically generating synthetic datasets consisting of C/NC video clip samples. As an example, we use ARCANE to generate a large and diverse dataset named PedSynth. We will show how PedSynth complements widely used real-world datasets such as JAAD and PIE, so enabling more accurate models for C/NC prediction. Considering the onboard deployment of C/NC prediction models, we also propose a deep model named PedGNN, which is fast and has a very low memory footprint. PedGNN is based on a GNN-GRU architecture that takes a sequence of pedestrian skeletons as input to predict crossing intentions.
Paper Structure (13 sections, 5 figures, 7 tables)

This paper contains 13 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Summary of two video clips from PedSynth. Top rows: a pedestrian crosses the road perpendicularly to the ego-vehicle moving direction. Bottom rows: a pedestrian change the intention of crossing the road at mid-lane. In both examples, the pedestrians enter the road at locations not enabled for crossing.
  • Figure 2: To perform C/NC predictions PedGNN processes sequences of pedestrian skeletons. To process onboard sequences while driving, we use a temporal sliding window of a 1-frame step. PedGNN consists of a graph convolutional gated recurrent unit (GConvGRU), followed by a block of three (ReLU + Fully connected) layers, and a final Softmax. Synthetic datasets with C/NC examples can be used for training PedGNN. For instance, in this paper, we use PedSynth, a synthetic dataset that we have generated using ARCANE, a framework that we introduce in this paper too (see Fig. \ref{['fig:arcane']}).
  • Figure 3: Block diagram of ARCANE dataset generator.
  • Figure 4: Pedestrian skeleton as expected by PedGNN. We consider 19 joints connected as an undirected graph.
  • Figure 5: Performance of PedGNN trained on JAAD+PedSynth and tested on JAAD. Cases (a) and (b) are fully successful, while in cases (c) and (d) there are C/NC prediction discrepancies with the labels provided by human labelers (GT). Time in each sequence runs from top-left to bottom-right.