Table of Contents
Fetching ...

10 Open Challenges Steering the Future of Vision-Language-Action Models

Soujanya Poria, Navonil Majumder, Chia-Yu Hung, Amir Ali Bagherzadeh, Chuan Li, Kenneth Kwok, Ziwei Wang, Cheston Tan, Jiajun Wu, David Hsu

TL;DR

This work surveys Vision-Language-Action (VLA) models as a path to embodied AI, articulating 10 open challenges that span sensing, reasoning, data, evaluation, generalization, efficiency, coordination, safety, agents, and human collaboration. It contrasts discrete-token and continuous-action VLA approaches, emphasizes the need for depth-aware perception, robust long-horizon reasoning, and scalable data pipelines, and highlights evaluation gaps due to limited robotic benchmarks and sim-to-real gaps. The authors advocate emerging trends such as hierarchical planning, spatially aware perception, universal action representations, and world dynamics, supported by data synthesis and post-training strategies that use world models and simulators as reward signals and safety evaluators. Collectively, the paper outlines a concrete roadmap to improve generalization, efficiency, safety, and human-robot collaboration in embodied AI, with practical implications for deployable, trustworthy VLA systems.

Abstract

Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.

10 Open Challenges Steering the Future of Vision-Language-Action Models

TL;DR

This work surveys Vision-Language-Action (VLA) models as a path to embodied AI, articulating 10 open challenges that span sensing, reasoning, data, evaluation, generalization, efficiency, coordination, safety, agents, and human collaboration. It contrasts discrete-token and continuous-action VLA approaches, emphasizes the need for depth-aware perception, robust long-horizon reasoning, and scalable data pipelines, and highlights evaluation gaps due to limited robotic benchmarks and sim-to-real gaps. The authors advocate emerging trends such as hierarchical planning, spatially aware perception, universal action representations, and world dynamics, supported by data synthesis and post-training strategies that use world models and simulators as reward signals and safety evaluators. Collectively, the paper outlines a concrete roadmap to improve generalization, efficiency, safety, and human-robot collaboration in embodied AI, with practical implications for deployable, trustworthy VLA systems.

Abstract

Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.

Paper Structure

This paper contains 32 sections, 6 equations, 2 figures, 1 algorithm.

Figures (2)

  • Figure 1: A high-level emerging VLA framework.
  • Figure :