10 Open Challenges Steering the Future of Vision-Language-Action Models
Soujanya Poria, Navonil Majumder, Chia-Yu Hung, Amir Ali Bagherzadeh, Chuan Li, Kenneth Kwok, Ziwei Wang, Cheston Tan, Jiajun Wu, David Hsu
TL;DR
This work surveys Vision-Language-Action (VLA) models as a path to embodied AI, articulating 10 open challenges that span sensing, reasoning, data, evaluation, generalization, efficiency, coordination, safety, agents, and human collaboration. It contrasts discrete-token and continuous-action VLA approaches, emphasizes the need for depth-aware perception, robust long-horizon reasoning, and scalable data pipelines, and highlights evaluation gaps due to limited robotic benchmarks and sim-to-real gaps. The authors advocate emerging trends such as hierarchical planning, spatially aware perception, universal action representations, and world dynamics, supported by data synthesis and post-training strategies that use world models and simulators as reward signals and safety evaluators. Collectively, the paper outlines a concrete roadmap to improve generalization, efficiency, safety, and human-robot collaboration in embodied AI, with practical implications for deployable, trustworthy VLA systems.
Abstract
Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.
