Table of Contents
Fetching ...

LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

Royden Wagner, Omer Sahin Tas, Jaime Villa, Felix Hauser, Yinzhe Shen, Marlon Steiner, Dominik Strutz, Carlos Fernandez, Christian Kinzig, Guillermo S. Guitierrez-Cabello, Hendrik Königshof, Fabian Immel, Richard Schwarzkopf, Nils Alexander Rack, Kevin Rösch, Kaiwen Wang, Jan-Hendrik Pauls, Martin Lauer, Igor Gilitschenski, Holger Caesar, Christoph Stiller

Abstract

In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization. The resulting benchmark for multimodal models, such as VLMs and VLAs, goes beyond safety and comfort metrics by evaluating instruction following and semantic coherence between model outputs. The multilingual reasoning traces in English, Spanish, and Chinese are from domain experts with diverse cultural backgrounds. Thus, our dataset is a unique resource for studying how different forms of reasoning affect driving competence. Our dataset is available at: https://hf.co/datasets/kit-mrt/kitscenes-longtail

LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

Abstract

In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization. The resulting benchmark for multimodal models, such as VLMs and VLAs, goes beyond safety and comfort metrics by evaluating instruction following and semantic coherence between model outputs. The multilingual reasoning traces in English, Spanish, and Chinese are from domain experts with diverse cultural backgrounds. Thus, our dataset is a unique resource for studying how different forms of reasoning affect driving competence. Our dataset is available at: https://hf.co/datasets/kit-mrt/kitscenes-longtail
Paper Structure (26 sections, 5 equations, 6 figures, 7 tables)

This paper contains 26 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Left: Strengths and weaknesses of datasets used to benchmark end-to-end driving: nuScenes, Waymo E2E, CoVLA, ours. Middle: A challenging long-tail scenario from our dataset. Right: The start of the expert reasoning trace for this scenario.
  • Figure 2: Distribution of scenario types. Numbers are percentages.
  • Figure 3: Multi-view videos with frame-wise stitching. Our dataset contains multi-view videos covering a 360 FoV with partial overlap. Our stitching method creates 360 views with overlapping areas in the rear-view (see the left and right borders in (g)). We show an example from our specifically selected scenarios, in which the vehicle drives in the oncoming lane to bypass a sit-in protest by climate activists.
  • Figure 4: Relationship between MMS and $L_2$ vs. DrivingScore (DS), with linear fits and Pearson $r$ values ($0.59$ and $-0.45$).
  • Figure 5: Qualitative results.(a) to (c): We show qualitative results of turning left and right at intersections (during heavy rain) and a lane change maneuver. The blue trajectories are expert trajectories, the orange trajectories are from our wrong speed category (too low in (a) and (c), and too high in (b)), the green trajectories are from our neglect instruction category. In addition, we show the predictions of Qwen3-VL in gray colors. We show representative trajectories, which are scored with 3.5 points since they are not matched. (d) to (f): Samples where we include trajectories from our crash category in purple.
  • ...and 1 more figures