VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving
Hyunki Seong, Seongwoo Moon, Hojin Ahn, Jehun Kang, David Hyunchul Shim
TL;DR
VLA-R tackles open-world end-to-end autonomous driving by integrating frozen vision–language perception with a vision–language–action retrieval framework. It introduces OW-QFormer to fuse multi-source open-world cues and an Action Transformer to encode tokenized trajectories, trained with a vision–action contrastive objective to align perception with executable motions. The approach demonstrates strong generalization on a real mobile robot in unstructured outdoor environments, including unseen terrain and novel objects, while enabling plug-and-play adaptation to different motion vocabularies without retraining. This work highlights the potential of language-grounded perception to drive interpretable, scalable, and generalizable end-to-end autonomous systems beyond closed-world assumptions.
Abstract
Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.
