Table of Contents
Fetching ...

Diving Deeper Into Pedestrian Behavior Understanding: Intention Estimation, Action Prediction, and Event Risk Assessment

Amir Rasouli, Iuliia Kotseruba

TL;DR

This work dissects pedestrian behavior understanding into three distinct tasks: intention estimation, action prediction, and event risk assessment. It introduces a benchmark based on JAAD and PIE, along with new per-task metrics and per-instance analyses to capture timing, consistency, and risk impact. By evaluating four SOTA models across tasks and input modalities, it reveals task-specific effects of context, highlights limited cross-task agreement, and stresses the need for temporally stable, multimodal approaches. The findings inform safer autonomous driving by emphasizing task-specific evaluation and the complementary roles of intention, action, and risk assessment.

Abstract

In this paper, we delve into the pedestrian behavior understanding problem from the perspective of three different tasks: intention estimation, action prediction, and event risk assessment. We first define the tasks and discuss how these tasks are represented and annotated in two widely used pedestrian datasets, JAAD and PIE. We then propose a new benchmark based on these definitions, available annotations, and three new classes of metrics, each designed to assess different aspects of the model performance. We apply the new evaluation approach to examine four SOTA prediction models on each task and compare their performance w.r.t. metrics and input modalities. In particular, we analyze the differences between intention estimation and action prediction tasks by considering various scenarios and contextual factors. Lastly, we examine model agreement across these two tasks to show their complementary role. The proposed benchmark reveals new facts about the role of different data modalities, the tasks, and relevant data properties. We conclude by elaborating on our findings and proposing future research directions.

Diving Deeper Into Pedestrian Behavior Understanding: Intention Estimation, Action Prediction, and Event Risk Assessment

TL;DR

This work dissects pedestrian behavior understanding into three distinct tasks: intention estimation, action prediction, and event risk assessment. It introduces a benchmark based on JAAD and PIE, along with new per-task metrics and per-instance analyses to capture timing, consistency, and risk impact. By evaluating four SOTA models across tasks and input modalities, it reveals task-specific effects of context, highlights limited cross-task agreement, and stresses the need for temporally stable, multimodal approaches. The findings inform safer autonomous driving by emphasizing task-specific evaluation and the complementary roles of intention, action, and risk assessment.

Abstract

In this paper, we delve into the pedestrian behavior understanding problem from the perspective of three different tasks: intention estimation, action prediction, and event risk assessment. We first define the tasks and discuss how these tasks are represented and annotated in two widely used pedestrian datasets, JAAD and PIE. We then propose a new benchmark based on these definitions, available annotations, and three new classes of metrics, each designed to assess different aspects of the model performance. We apply the new evaluation approach to examine four SOTA prediction models on each task and compare their performance w.r.t. metrics and input modalities. In particular, we analyze the differences between intention estimation and action prediction tasks by considering various scenarios and contextual factors. Lastly, we examine model agreement across these two tasks to show their complementary role. The proposed benchmark reveals new facts about the role of different data modalities, the tasks, and relevant data properties. We conclude by elaborating on our findings and proposing future research directions.
Paper Structure (28 sections, 3 equations, 5 figures, 6 tables)

This paper contains 28 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An overview of different tasks of pedestrian behavior understanding. Top: connections between different tasks---definite (solid arrows) and probable (dashed arrows). Bottom: examples of pedestrians with different types of behavior and associated risks.
  • Figure 2: Overview of annotations and sampling in PIE. Intention labels are represented by aggregated votes of human observers who watched videos of pedestrians from experiment start up to the critical point. Action labels are based on the observed action of crossing in front of the ego-vehicle. Sequences for action prediction task are sampled so that the observations end between $1$-$3s$ TTE. Observation start is the earliest frame that is fed to the model.
  • Figure 3: Example of event risk regions overlaid on the view from the ego-vehicle. Colors from red to green represent the associated risk from highest to lowest, respectively.
  • Figure 4: Per-instance metric example for binary action prediction. GT refers to ground truth. Soft label is computed by averaging over prediction confidence of all samples. Hard label is set to a label other than ground truth if prediction for at least one sample does not agree with the rest, i.e. it is treated as a misprediction. If predicted labels for all samples are the same, hard label will be the same.
  • Figure 5: Per-class average precision of models for the event risk assessment task. The background color in each graph represents risk associated with each region.