Table of Contents
Fetching ...

VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving

Hyunki Seong, Seongwoo Moon, Hojin Ahn, Jehun Kang, David Hyunchul Shim

TL;DR

VLA-R tackles open-world end-to-end autonomous driving by integrating frozen vision–language perception with a vision–language–action retrieval framework. It introduces OW-QFormer to fuse multi-source open-world cues and an Action Transformer to encode tokenized trajectories, trained with a vision–action contrastive objective to align perception with executable motions. The approach demonstrates strong generalization on a real mobile robot in unstructured outdoor environments, including unseen terrain and novel objects, while enabling plug-and-play adaptation to different motion vocabularies without retraining. This work highlights the potential of language-grounded perception to drive interpretable, scalable, and generalizable end-to-end autonomous systems beyond closed-world assumptions.

Abstract

Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.

VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving

TL;DR

VLA-R tackles open-world end-to-end autonomous driving by integrating frozen vision–language perception with a vision–language–action retrieval framework. It introduces OW-QFormer to fuse multi-source open-world cues and an Action Transformer to encode tokenized trajectories, trained with a vision–action contrastive objective to align perception with executable motions. The approach demonstrates strong generalization on a real mobile robot in unstructured outdoor environments, including unseen terrain and novel objects, while enabling plug-and-play adaptation to different motion vocabularies without retraining. This work highlights the potential of language-grounded perception to drive interpretable, scalable, and generalizable end-to-end autonomous systems beyond closed-world assumptions.

Abstract

Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.

Paper Structure

This paper contains 21 sections, 9 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Comparison between action generation paradigms.
  • Figure 2: Overview of our Vision-Language Action Retrieval (VLA-R). Our key contribution is leveraging open-world perception features together with a novel Action Retrieval mechanism, towarding generalizable autonomous driving.
  • Figure 3: We utilize a frozen open-world perception network as the backbone, enabling OW-E2EAD in real-world environments. In addition to visual embeddings, we use text-aligned visual features and bounding-box distributions as inputs.
  • Figure 4: Our three-dimensional action token vocabulary.
  • Figure 5: Scenarios and robot platform used in our experiments.
  • ...and 4 more figures