CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving
Changhe Chen, Mozhgan Pourkeshavarz, Amir Rasouli
TL;DR
CRITERIA introduces a scenario-aware benchmarking paradigm for trajectory prediction in autonomous driving, addressing biases in traditional benchmarks that favor common cruising scenarios and length-biased metrics. By extracting driving scenarios along road structure, model performance, and data properties, and by proposing bias-free diversity (AAE, AMV) and admissibility (ATT) metrics, the paper enables more nuanced model ranking and behavior characterization. Extensive experiments on Argoverse show that top-accuracy models need not be top in diversity or admissibility, and that balanced models (e.g., MMTransformer, LaneGCN) provide more robust performance across criteria. The framework offers a practical tool for comparing models under varied real-world conditions and guides future architectural improvements and evaluation protocols.
Abstract
Benchmarking is a common method for evaluating trajectory prediction models for autonomous driving. Existing benchmarks rely on datasets, which are biased towards more common scenarios, such as cruising, and distance-based metrics that are computed by averaging over all scenarios. Following such a regiment provides a little insight into the properties of the models both in terms of how well they can handle different scenarios and how admissible and diverse their outputs are. There exist a number of complementary metrics designed to measure the admissibility and diversity of trajectories, however, they suffer from biases, such as length of trajectories. In this paper, we propose a new benChmarking paRadIgm for evaluaTing trajEctoRy predIction Approaches (CRITERIA). Particularly, we propose 1) a method for extracting driving scenarios at varying levels of specificity according to the structure of the roads, models' performance, and data properties for fine-grained ranking of prediction models; 2) A set of new bias-free metrics for measuring diversity, by incorporating the characteristics of a given scenario, and admissibility, by considering the structure of roads and kinematic compliancy, motivated by real-world driving constraints. 3) Using the proposed benchmark, we conduct extensive experimentation on a representative set of the prediction models using the large scale Argoverse dataset. We show that the proposed benchmark can produce a more accurate ranking of the models and serve as a means of characterizing their behavior. We further present ablation studies to highlight contributions of different elements that are used to compute the proposed metrics.
