Table of Contents
Fetching ...

CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving

Changhe Chen, Mozhgan Pourkeshavarz, Amir Rasouli

TL;DR

CRITERIA introduces a scenario-aware benchmarking paradigm for trajectory prediction in autonomous driving, addressing biases in traditional benchmarks that favor common cruising scenarios and length-biased metrics. By extracting driving scenarios along road structure, model performance, and data properties, and by proposing bias-free diversity (AAE, AMV) and admissibility (ATT) metrics, the paper enables more nuanced model ranking and behavior characterization. Extensive experiments on Argoverse show that top-accuracy models need not be top in diversity or admissibility, and that balanced models (e.g., MMTransformer, LaneGCN) provide more robust performance across criteria. The framework offers a practical tool for comparing models under varied real-world conditions and guides future architectural improvements and evaluation protocols.

Abstract

Benchmarking is a common method for evaluating trajectory prediction models for autonomous driving. Existing benchmarks rely on datasets, which are biased towards more common scenarios, such as cruising, and distance-based metrics that are computed by averaging over all scenarios. Following such a regiment provides a little insight into the properties of the models both in terms of how well they can handle different scenarios and how admissible and diverse their outputs are. There exist a number of complementary metrics designed to measure the admissibility and diversity of trajectories, however, they suffer from biases, such as length of trajectories. In this paper, we propose a new benChmarking paRadIgm for evaluaTing trajEctoRy predIction Approaches (CRITERIA). Particularly, we propose 1) a method for extracting driving scenarios at varying levels of specificity according to the structure of the roads, models' performance, and data properties for fine-grained ranking of prediction models; 2) A set of new bias-free metrics for measuring diversity, by incorporating the characteristics of a given scenario, and admissibility, by considering the structure of roads and kinematic compliancy, motivated by real-world driving constraints. 3) Using the proposed benchmark, we conduct extensive experimentation on a representative set of the prediction models using the large scale Argoverse dataset. We show that the proposed benchmark can produce a more accurate ranking of the models and serve as a means of characterizing their behavior. We further present ablation studies to highlight contributions of different elements that are used to compute the proposed metrics.

CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving

TL;DR

CRITERIA introduces a scenario-aware benchmarking paradigm for trajectory prediction in autonomous driving, addressing biases in traditional benchmarks that favor common cruising scenarios and length-biased metrics. By extracting driving scenarios along road structure, model performance, and data properties, and by proposing bias-free diversity (AAE, AMV) and admissibility (ATT) metrics, the paper enables more nuanced model ranking and behavior characterization. Extensive experiments on Argoverse show that top-accuracy models need not be top in diversity or admissibility, and that balanced models (e.g., MMTransformer, LaneGCN) provide more robust performance across criteria. The framework offers a practical tool for comparing models under varied real-world conditions and guides future architectural improvements and evaluation protocols.

Abstract

Benchmarking is a common method for evaluating trajectory prediction models for autonomous driving. Existing benchmarks rely on datasets, which are biased towards more common scenarios, such as cruising, and distance-based metrics that are computed by averaging over all scenarios. Following such a regiment provides a little insight into the properties of the models both in terms of how well they can handle different scenarios and how admissible and diverse their outputs are. There exist a number of complementary metrics designed to measure the admissibility and diversity of trajectories, however, they suffer from biases, such as length of trajectories. In this paper, we propose a new benChmarking paRadIgm for evaluaTing trajEctoRy predIction Approaches (CRITERIA). Particularly, we propose 1) a method for extracting driving scenarios at varying levels of specificity according to the structure of the roads, models' performance, and data properties for fine-grained ranking of prediction models; 2) A set of new bias-free metrics for measuring diversity, by incorporating the characteristics of a given scenario, and admissibility, by considering the structure of roads and kinematic compliancy, motivated by real-world driving constraints. 3) Using the proposed benchmark, we conduct extensive experimentation on a representative set of the prediction models using the large scale Argoverse dataset. We show that the proposed benchmark can produce a more accurate ranking of the models and serve as a means of characterizing their behavior. We further present ablation studies to highlight contributions of different elements that are used to compute the proposed metrics.
Paper Structure (27 sections, 3 equations, 3 figures, 4 tables)

This paper contains 27 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Examples of two scenarios with two different prediction sets where diversity and admissibility of the trajectories are not reflected in minimum final displacement error (minFDE). In the top two predictions, minFDEs are the same, even though in the right prediction, the red trajectories are inadmissible. In the bottom row, minFDE of both cases are the same, but the right example has a more diverse trajectory.
  • Figure 2: Performance comparison on the challenging scenarios in terms of diversity, admissibility, and minFDE (represented as the size of circles, so smaller is better).
  • Figure 3: Qualitative samples of different ATT test scores for predicted trajectories. The red lines indicate inadmissible trajectories. From left to right, the samples are generated by TNT, MMTransformer, TNT and LaneGCN.