Table of Contents
Fetching ...

WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, Ben Sapp, Mingxing Tan, Jyh-Jing Hwang, Dragomir Anguelov

TL;DR

The paper introduces WOD-E2E, a long-tail focused end-to-end driving benchmark with 4,021 real-world segments (~12 hours) and eight-camera 360° coverage, designed to stress-test E2E systems in rare, safety-critical scenarios. It pairs this dataset with the Rater Feedback Score (RFS), a human-aligned open-loop metric that evaluates predicted trajectories against multiple expert-rated references, addressing limitations of ADE and PDMS in multimodal, long-tail contexts. The authors demonstrate that traditional ADE metrics do not reliably reflect safety performance, and show that MLLM-based and RL-enhanced approaches can improve RFS, especially when rewards align with human preferences. They also present a rigorous long-tail mining and labeling pipeline, including critical moment selection and trajectory scoring, establishing a realistic framework for evaluating and advancing robust, generalizable E2E autonomous driving. Overall, WOD-E2E provides a challenging benchmark and a practical evaluation paradigm that can drive progress in safe, end-to-end driving research and inform future simulators and challenges.

Abstract

Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.

WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios

TL;DR

The paper introduces WOD-E2E, a long-tail focused end-to-end driving benchmark with 4,021 real-world segments (~12 hours) and eight-camera 360° coverage, designed to stress-test E2E systems in rare, safety-critical scenarios. It pairs this dataset with the Rater Feedback Score (RFS), a human-aligned open-loop metric that evaluates predicted trajectories against multiple expert-rated references, addressing limitations of ADE and PDMS in multimodal, long-tail contexts. The authors demonstrate that traditional ADE metrics do not reliably reflect safety performance, and show that MLLM-based and RL-enhanced approaches can improve RFS, especially when rewards align with human preferences. They also present a rigorous long-tail mining and labeling pipeline, including critical moment selection and trajectory scoring, establishing a realistic framework for evaluating and advancing robust, generalizable E2E autonomous driving. Overall, WOD-E2E provides a challenging benchmark and a practical evaluation paradigm that can drive progress in safe, end-to-end driving research and inform future simulators and challenges.

Abstract

Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.

Paper Structure

This paper contains 34 sections, 3 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Long-tail scenario examples from the Waymo Open Dataset for End-to-End Driving (WOD-E2E). Unlike existing datasets that are commonly used for E2E Driving benchmarking, WOD-E2E dataset has more explicit focus on long-tail scenarios. Our analysis in Section \ref{['sec:mining']} shows that WOD-E2E captures the long-tail scenarios with a frequency of less than 0.03% in daily driving.
  • Figure 2: High-level routing input. Ground-truth vehicle trajectories over future 5s are shown. Each trajectory is colored red/black/blue corresponding to left/straight/right routing input, derived from 10s futures. Units are in meters.
  • Figure 3: Left: Rarity comparison of driving Datasets. This figure shows the average rarity score for the top percentage of data in each dataset, highlighting the distribution of rare events in WOD-E2E. Right: Proportion of mined long-tail scenarios (0.03%) from the total driving corpus (6.4 million miles).
  • Figure 4: Comprehensive data distribution analysis. This figure illustrates the key characteristics of the WOD-E2E dataset across three critical dimensions. Top Left: Distribution of service areas by city. Bottom Left: Distribution of scenario clusters and their breakdowns by road type. Right: Distribution of driving behaviors.
  • Figure 5: An illustration of how a critical frame is selected. The human raters first scan through the video for high-level understanding, and then select the critical frame, which is the earliest moment when a critical event is visually apparent in the camera images. Finally, the raters also document the rationales for the critical frame selection.
  • ...and 5 more figures