Table of Contents
Fetching ...

On Offline Evaluation of Vision-based Driving Models

Felipe Codevilla, Antonio M. López, Vladlen Koltun, Alexey Dosovitskiy

TL;DR

The paper addresses the challenge of evaluating vision-based driving models offline and investigates how offline metrics relate to real driving quality. It demonstrates that offline prediction error does not reliably predict driving performance, since models with identical errors can perform very differently in driving tasks. The authors show that aligning offline metrics with driving quality depends on choosing appropriate validation data and metrics, and they explore a wide range of training configurations (multi-camera input, noise injection, network depth, and regularization). Ultimately, the work highlights the practical value and pitfalls of offline evaluation for autonomous driving, emphasizing dataset and metric selection to better approximate real-world driving performance.

Abstract

Autonomous driving models should ideally be evaluated by deploying them on a fleet of physical vehicles in the real world. Unfortunately, this approach is not practical for the vast majority of researchers. An attractive alternative is to evaluate models offline, on a pre-collected validation dataset with ground truth annotation. In this paper, we investigate the relation between various online and offline metrics for evaluation of autonomous driving models. We find that offline prediction error is not necessarily correlated with driving quality, and two models with identical prediction error can differ dramatically in their driving performance. We show that the correlation of offline evaluation with driving quality can be significantly improved by selecting an appropriate validation dataset and suitable offline metrics. The supplementary video can be viewed at https://www.youtube.com/watch?v=P8K8Z-iF0cY

On Offline Evaluation of Vision-based Driving Models

TL;DR

The paper addresses the challenge of evaluating vision-based driving models offline and investigates how offline metrics relate to real driving quality. It demonstrates that offline prediction error does not reliably predict driving performance, since models with identical errors can perform very differently in driving tasks. The authors show that aligning offline metrics with driving quality depends on choosing appropriate validation data and metrics, and they explore a wide range of training configurations (multi-camera input, noise injection, network depth, and regularization). Ultimately, the work highlights the practical value and pitfalls of offline evaluation for autonomous driving, emphasizing dataset and metric selection to better approximate real-world driving performance.

Abstract

Autonomous driving models should ideally be evaluated by deploying them on a fleet of physical vehicles in the real world. Unfortunately, this approach is not practical for the vast majority of researchers. An attractive alternative is to evaluate models offline, on a pre-collected validation dataset with ground truth annotation. In this paper, we investigate the relation between various online and offline metrics for evaluation of autonomous driving models. We find that offline prediction error is not necessarily correlated with driving quality, and two models with identical prediction error can differ dramatically in their driving performance. We show that the correlation of offline evaluation with driving quality can be significantly improved by selecting an appropriate validation dataset and suitable offline metrics. The supplementary video can be viewed at https://www.youtube.com/watch?v=P8K8Z-iF0cY

Paper Structure

This paper contains 7 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Scatter plots of goal-directed navigation success rate vs steering absolute error when evaluated on data from different distributions. Town 1 (training conditions), best 50% of the models.
  • Figure 2: Scatter plots of goal-directed navigation success rate vs different offline metrics. Town 1 (training conditions), best 50% of the models.
  • Figure 3: Scatter plots of online driving quality metrics versus each other. The metrics are: success rate, average fraction of distance to the goal covered (average completion), and average distance (in km) driven between two infractions. Town 1 (training conditions), all models.
  • Figure 4: Scatter plots of goal-directed navigation success rate vs steering absolute error when evaluated on data from different distributions. Town 1 (training conditions), all models.
  • Figure 5: Scatter plots of goal-directed navigation success rate vs different offline metrics. Town 1 (training conditions), all models.
  • ...and 2 more figures