Table of Contents
Fetching ...

doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation

Parthib Roy, Srinivasa Perisetla, Shashank Shriram, Harsha Krishnaswamy, Aryan Keskar, Ross Greer

TL;DR

The paper addresses the gap between natural-language human instructions and autonomous driving actions by introducing doScenes, a real-world dataset that pairs nuScenes clips with short-term driving directives and referentiality tags. It retroactively annotates 1,000 scenes using a taxi-test heuristic, enabling instruction-grounded learning for vision-language navigation and action-conditioned planning. The contributions include a publicly available dataset that links imperative language to object-referenced driving maneuvers, along with analysis of instruction referentiality and guidance for evaluating instruction-conditioned motion planning. This work enhances safe human-vehicle collaboration by enabling models to interpret and act on natural language commands tied to dynamic and static scene elements in real-world driving data.

Abstract

Human-interactive robotic systems, particularly autonomous vehicles (AVs), must effectively integrate human instructions into their motion planning. This paper introduces doScenes, a novel dataset designed to facilitate research on human-vehicle instruction interactions, focusing on short-term directives that directly influence vehicle motion. By annotating multimodal sensor data with natural language instructions and referentiality tags, doScenes bridges the gap between instruction and driving response, enabling context-aware and adaptive planning. Unlike existing datasets that focus on ranking or scene-level reasoning, doScenes emphasizes actionable directives tied to static and dynamic scene objects. This framework addresses limitations in prior research, such as reliance on simulated data or predefined action sets, by supporting nuanced and flexible responses in real-world scenarios. This work lays the foundation for developing learning strategies that seamlessly integrate human instructions into autonomous systems, advancing safe and effective human-vehicle collaboration for vision-language navigation. We make our data publicly available at https://www.github.com/rossgreer/doScenes

doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation

TL;DR

The paper addresses the gap between natural-language human instructions and autonomous driving actions by introducing doScenes, a real-world dataset that pairs nuScenes clips with short-term driving directives and referentiality tags. It retroactively annotates 1,000 scenes using a taxi-test heuristic, enabling instruction-grounded learning for vision-language navigation and action-conditioned planning. The contributions include a publicly available dataset that links imperative language to object-referenced driving maneuvers, along with analysis of instruction referentiality and guidance for evaluating instruction-conditioned motion planning. This work enhances safe human-vehicle collaboration by enabling models to interpret and act on natural language commands tied to dynamic and static scene elements in real-world driving data.

Abstract

Human-interactive robotic systems, particularly autonomous vehicles (AVs), must effectively integrate human instructions into their motion planning. This paper introduces doScenes, a novel dataset designed to facilitate research on human-vehicle instruction interactions, focusing on short-term directives that directly influence vehicle motion. By annotating multimodal sensor data with natural language instructions and referentiality tags, doScenes bridges the gap between instruction and driving response, enabling context-aware and adaptive planning. Unlike existing datasets that focus on ranking or scene-level reasoning, doScenes emphasizes actionable directives tied to static and dynamic scene objects. This framework addresses limitations in prior research, such as reliance on simulated data or predefined action sets, by supporting nuanced and flexible responses in real-world scenarios. This work lays the foundation for developing learning strategies that seamlessly integrate human instructions into autonomous systems, advancing safe and effective human-vehicle collaboration for vision-language navigation. We make our data publicly available at https://www.github.com/rossgreer/doScenes

Paper Structure

This paper contains 8 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Typical nuScenes data includes 3D bounding box annotations, LiDAR point clouds, and driving area map feature layers. In the doScenes dataset, we augment each clip of temporal data with an instruction and a tag to indicate the instruction's referentiality.
  • Figure 2: Histogram of number of instruction annotations per scene; most of the scenes of doScenes have only one or two annotations. Having a greater number of instruction annotations reflects an annotator's generation of multiple possible instructions that could cause the same scene playout.