Table of Contents
Fetching ...

Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs

Amir Taherin, Juyi Lin, Arash Akbari, Arman Akbari, Pu Zhao, Weiwei Chen, David Kaeli, Yanzhi Wang

TL;DR

This paper tackles the problem of understanding how Vision-Language-Action models scale across hardware platforms from edge to cloud under limited power budgets. It evaluates five VLA architectures, including two new ones (VOTE and QwenVLA), on edge Jetson AGX Orin and multiple datacenter GPUs using the LIBERO benchmark, measuring accuracy, latency, throughput, and peak memory. The study finds that backbone size and vision encoders largely determine memory footprint, while chunked decoding architectures yield the highest throughput with minimal accuracy loss; edge devices can even outperform older datacenter GPUs under certain configurations, challenging the assumption that cloud GPUs are universally superior for robotic inference. The results provide practical deployment guidelines and highlight directions for future work in additional architectures and quantization strategies to further optimize VLA inference in real-world robotics.

Abstract

Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic control, yet their performance scaling across model architectures and hardware platforms, as well as their associated power budgets, remain poorly understood. This work presents an evaluation of five representative VLA models -- spanning state-of-the-art baselines and two newly proposed architectures -- targeting edge and datacenter GPU platforms. Using the LIBERO benchmark, we measure accuracy alongside system-level metrics, including latency, throughput, and peak memory usage, under varying edge power constraints and high-performance datacenter GPU configurations. Our results identify distinct scaling trends: (1) architectural choices, such as action tokenization and model backbone size, strongly influence throughput and memory footprint; (2) power-constrained edge devices exhibit non-linear performance degradation, with some configurations matching or exceeding older datacenter GPUs; and (3) high-throughput variants can be achieved without significant accuracy loss. These findings provide actionable insights when selecting and optimizing VLAs across a range of deployment constraints. Our work challenges current assumptions about the superiority of datacenter hardware for robotic inference.

Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs

TL;DR

This paper tackles the problem of understanding how Vision-Language-Action models scale across hardware platforms from edge to cloud under limited power budgets. It evaluates five VLA architectures, including two new ones (VOTE and QwenVLA), on edge Jetson AGX Orin and multiple datacenter GPUs using the LIBERO benchmark, measuring accuracy, latency, throughput, and peak memory. The study finds that backbone size and vision encoders largely determine memory footprint, while chunked decoding architectures yield the highest throughput with minimal accuracy loss; edge devices can even outperform older datacenter GPUs under certain configurations, challenging the assumption that cloud GPUs are universally superior for robotic inference. The results provide practical deployment guidelines and highlight directions for future work in additional architectures and quantization strategies to further optimize VLA inference in real-world robotics.

Abstract

Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic control, yet their performance scaling across model architectures and hardware platforms, as well as their associated power budgets, remain poorly understood. This work presents an evaluation of five representative VLA models -- spanning state-of-the-art baselines and two newly proposed architectures -- targeting edge and datacenter GPU platforms. Using the LIBERO benchmark, we measure accuracy alongside system-level metrics, including latency, throughput, and peak memory usage, under varying edge power constraints and high-performance datacenter GPU configurations. Our results identify distinct scaling trends: (1) architectural choices, such as action tokenization and model backbone size, strongly influence throughput and memory footprint; (2) power-constrained edge devices exhibit non-linear performance degradation, with some configurations matching or exceeding older datacenter GPUs; and (3) high-throughput variants can be achieved without significant accuracy loss. These findings provide actionable insights when selecting and optimizing VLAs across a range of deployment constraints. Our work challenges current assumptions about the superiority of datacenter hardware for robotic inference.

Paper Structure

This paper contains 8 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Peak VRAM usage for each evaluated VLA model during inference on the NVIDIA Jetson AGX Orin.
  • Figure 2: Per-chunk latency for each VLA model evaluated on the H100 datacenter GPU and Jetson AGX Orin (MAX power mode). The H100 achieves latencies roughly an order of magnitude smaller than the Orin across all models. VOTE configurations are consistently competitive on both platforms, with VOTE-MLP4 achieving the lowest latency on Orin.
  • Figure 3: Throughput (Hz) for each evaluated VLA model across (a) four datacenter GPUs and (b) Jetson AGX Orin under different power modes. Results highlight scaling trends with hardware class, power budget, and model architecture.