Table of Contents
Fetching ...

Scalable Offline Metrics for Autonomous Driving

Animikh Aich, Adwait Kulkarni, Eshed Ohn-Bar

TL;DR

The paper tackles the gap between offline evaluation of vision-based autonomous driving policies and their online safety performance. It introduces Uncertainty-Weighted Error (UWE), an offline metric that uses MC dropout to estimate epistemic uncertainty and weight multiple offline error terms, aiming to better predict online outcomes such as Driving Score. Through extensive CARLA simulations and a small-scale real-world testbed, UWE shows stronger correlation with online metrics than traditional offline measures, with improvements around 13% in simulation and robustness across settings; ensemble-based uncertainty further boosts reliability. The findings advocate for uncertainty-aware, scalable offline evaluation as a practical proxy for online driving safety, while acknowledging limitations in scaling to full urban environments and calling for broader validation.

Abstract

Real-world evaluation of perception-based planning models for robotic systems, such as autonomous vehicles, can be safely and inexpensively conducted offline, i.e. by computing model prediction error over a pre-collected validation dataset with ground-truth annotations. However, extrapolating from offline model performance to online settings remains a challenge. In these settings, seemingly minor errors can compound and result in test-time infractions or collisions. This relationship is understudied, particularly across diverse closed-loop metrics and complex urban maneuvers. In this work, we revisit this undervalued question in policy evaluation through an extensive set of experiments across diverse conditions and metrics. Based on analysis in simulation, we find an even worse correlation between offline and online settings than reported by prior studies, casting doubts on the validity of current evaluation practices and metrics for driving policies. Next, we bridge the gap between offline and online evaluation. We investigate an offline metric based on epistemic uncertainty, which aims to capture events that are likely to cause errors in closed-loop settings. The resulting metric achieves over 13% improvement in correlation compared to previous offline metrics. We further validate the generalization of our findings beyond the simulation environment in real-world settings, where even greater gains are observed.

Scalable Offline Metrics for Autonomous Driving

TL;DR

The paper tackles the gap between offline evaluation of vision-based autonomous driving policies and their online safety performance. It introduces Uncertainty-Weighted Error (UWE), an offline metric that uses MC dropout to estimate epistemic uncertainty and weight multiple offline error terms, aiming to better predict online outcomes such as Driving Score. Through extensive CARLA simulations and a small-scale real-world testbed, UWE shows stronger correlation with online metrics than traditional offline measures, with improvements around 13% in simulation and robustness across settings; ensemble-based uncertainty further boosts reliability. The findings advocate for uncertainty-aware, scalable offline evaluation as a practical proxy for online driving safety, while acknowledging limitations in scaling to full urban environments and calling for broader validation.

Abstract

Real-world evaluation of perception-based planning models for robotic systems, such as autonomous vehicles, can be safely and inexpensively conducted offline, i.e. by computing model prediction error over a pre-collected validation dataset with ground-truth annotations. However, extrapolating from offline model performance to online settings remains a challenge. In these settings, seemingly minor errors can compound and result in test-time infractions or collisions. This relationship is understudied, particularly across diverse closed-loop metrics and complex urban maneuvers. In this work, we revisit this undervalued question in policy evaluation through an extensive set of experiments across diverse conditions and metrics. Based on analysis in simulation, we find an even worse correlation between offline and online settings than reported by prior studies, casting doubts on the validity of current evaluation practices and metrics for driving policies. Next, we bridge the gap between offline and online evaluation. We investigate an offline metric based on epistemic uncertainty, which aims to capture events that are likely to cause errors in closed-loop settings. The resulting metric achieves over 13% improvement in correlation compared to previous offline metrics. We further validate the generalization of our findings beyond the simulation environment in real-world settings, where even greater gains are observed.

Paper Structure

This paper contains 11 sections, 6 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Closing the Gap Between Offline and Online Evaluation for Autonomous Driving. We revisit the relationship between offline and online evaluation of vision-based autonomous driving policies. Our key insight is that model uncertainty-weighted errors (shown on the left) can be used to estimate errors in online settings (shown on the right), such as collisions and traffic infractions. We then devise a simple and scalable metric that can be applied over offline data without requiring high-quality perception information or various surrounding agent prediction models. We validate the effectiveness of the metrics in both simulated and real-world settings across diverse model sampling strategies, types of policy failures, and platforms.
  • Figure 2: Real-World Vehicle Platform (Left) and Bird's Eye View of an Example Evaluation Route (Right). The vehicle is a hacked off-the-shelf RC car (Traxxas XMaxx). It is equipped with a Jetson Orin with camera and is controlled by a joystick controller for data collection and evaluation.
  • Figure 3: Driving Score Correlation Analysis in Simulation. The plot shows updated correlation for offline metrics, including TRE, QCE and PDM, the most successful reported metrics by prior research Codevilla2018ECCVdauner2024navsim, as well as widely employed metrics such as waypoint FDE (Final waypoint Displacement Error). Given our updated evaluation setup with complex traffic scenarios and diverse models, correlations are low overall, besides the introduced uncertainty-weighted (UW) version. Each disc shows one model sampled from a certain epoch, backbone, input (with or without LiDAR), and a test-time dropout rate used to obtain coverage of models with varying performances (see further explanation in Sec. \ref{['subsec:setup']}). The radius of each marker is proportional to the percent of test-time dropout used (this sampling is independent of our proposed metric).
  • Figure 4: Correlation Analysis in Simulation. Best overall performance is shown by UWE (the proposed uncertainty-weighted error).
  • Figure 5: Correlation Analysis in the Real-World. Results depict targeted scenarios evaluation (left) in the real-world (short routes over a traffic light or pedestrian scenario) and naturalistic longer routes (right). We find consistent trends with simulation-based results, i.e., UW action correlates well all online metrics when compared to the TRE, QCE, and baseline steer MAE and action MAE in targeted scenarios. Similarly, in longer routes, we show competitive correlation, i.e., compared to the baseline steer and action MAE, while outperforming TRE and QCE.