Table of Contents
Fetching ...

AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge

Noriaki Hirose, Catherine Glossop, Dhruv Shah, Sergey Levine

TL;DR

This work proposes AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution, and introduces an end-to-end finetuning protocol and a trajectory re-weighting strategy that prioritizes dynamic interactions.

Abstract

Robotic foundation models achieve strong generalization by leveraging internet-scale vision-language representations, but their massive computational cost creates a fundamental bottleneck: high inference latency. In dynamic environments, this latency breaks the control loop, rendering powerful models unsafe for real-time deployment. We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution. Inspired by hierarchical control, AsyncVLA runs a large foundation model on a remote workstation to provide high-level guidance, while a lightweight, onboard Edge Adapter continuously refines actions at high frequency. To bridge the domain gap between these asynchronous streams, we introduce an end-to-end finetuning protocol and a trajectory re-weighting strategy that prioritizes dynamic interactions. We evaluate our approach on real-world vision-based navigation tasks with communication delays up to 6 seconds. AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines, effectively bridging the gap between the semantic intelligence of large models and the reactivity required for edge robotics.

AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge

TL;DR

This work proposes AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution, and introduces an end-to-end finetuning protocol and a trajectory re-weighting strategy that prioritizes dynamic interactions.

Abstract

Robotic foundation models achieve strong generalization by leveraging internet-scale vision-language representations, but their massive computational cost creates a fundamental bottleneck: high inference latency. In dynamic environments, this latency breaks the control loop, rendering powerful models unsafe for real-time deployment. We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution. Inspired by hierarchical control, AsyncVLA runs a large foundation model on a remote workstation to provide high-level guidance, while a lightweight, onboard Edge Adapter continuously refines actions at high frequency. To bridge the domain gap between these asynchronous streams, we introduce an end-to-end finetuning protocol and a trajectory re-weighting strategy that prioritizes dynamic interactions. We evaluate our approach on real-world vision-based navigation tasks with communication delays up to 6 seconds. AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines, effectively bridging the gap between the semantic intelligence of large models and the reactivity required for edge robotics.
Paper Structure (24 sections, 4 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 4 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Network architecture. We train a Edge Adapter on top of the large robotic foundation model, OmniVLA, for vision-based navigation. During inference, the Edge Adapter runs on the robot's onboard controller at maximum speed to adjust the robot's behavior to the current environment, and the base VLA runs on the workstation to provide rich visual and language understanding.
  • Figure 2: Hardware setup for our proposed system. The Edge Adapter is deployed on an NVIDIA Jetson Orin mounted on a mobile robot (Vizbot), while the base VLA runs on a remote workstation equipped with an NVIDIA RTX 4090.
  • Figure 3: Time delay between the workstation and the robot onboard controller in our experiments. We measure time delay across four different environments and visualize the results using distinct colors for each environment.
  • Figure 4: Visualization of 2D pose-conditioned navigation in the presence of a pedestrian under a workstation latency of 0.2 s and fluctuating network latency. Our AsyncVLA yields to the pedestrian and then continues along the required trajectory to reach the goal.
  • Figure 5: Visualization of language-conditioned navigation under a workstation latency of 0.2 s. Our AsyncVLA can achieve strong language following performance by leveraging the large base VLA.
  • ...and 3 more figures