WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, Ben Sapp, Mingxing Tan, Jyh-Jing Hwang, Dragomir Anguelov
TL;DR
The paper introduces WOD-E2E, a long-tail focused end-to-end driving benchmark with 4,021 real-world segments (~12 hours) and eight-camera 360° coverage, designed to stress-test E2E systems in rare, safety-critical scenarios. It pairs this dataset with the Rater Feedback Score (RFS), a human-aligned open-loop metric that evaluates predicted trajectories against multiple expert-rated references, addressing limitations of ADE and PDMS in multimodal, long-tail contexts. The authors demonstrate that traditional ADE metrics do not reliably reflect safety performance, and show that MLLM-based and RL-enhanced approaches can improve RFS, especially when rewards align with human preferences. They also present a rigorous long-tail mining and labeling pipeline, including critical moment selection and trajectory scoring, establishing a realistic framework for evaluating and advancing robust, generalizable E2E autonomous driving. Overall, WOD-E2E provides a challenging benchmark and a practical evaluation paradigm that can drive progress in safe, end-to-end driving research and inform future simulators and challenges.
Abstract
Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.
