Table of Contents
Fetching ...

NanoVLA: Routing Decoupled Vision-Language Understanding for Nano-sized Generalist Robotic Policies

Jiahong Chen, Jing Wang, Long Chen, Chuwei Cai, Jinghui Lu

TL;DR

The paper addresses the gap between state-of-the-art vision-language-action policies and resource-constrained edge hardware. It introduces NanoVLA, a lightweight VLA framework that decouples vision-language fusion, unrolls actions over time with long-short chunks, and routes computation adaptively across backbones. The method achieves significant edge efficiency—up to 52x faster inference—with far fewer parameters while maintaining or surpassing accuracy on LIBERO benchmarks and real-world tasks. Ablation studies validate the importance of decoupling, chunking, and routing for cross-task transferability and cost-performance trade-offs. This work offers a practical path toward real-time embodied AI on edge devices.

Abstract

Vision-language-action (VLA) models have significantly advanced robotic manipulation by integrating vision-language models (VLMs), and action decoders into a unified architecture. However, their deployment on resource-constrained edge devices, such as mobile robots or embedded systems (e.g., Jetson Orin Nano), remains challenging due to high computational demands, especially in real-world scenarios where power, latency, and computational resources are critical. To close this gap, we introduce Nano-scale Vision-Language Action (NanoVLA), a family of lightweight VLA architectures that achieve high performance with minimal resources. Our core innovations include: (1) vision-language decoupling that moves conventional early vision and language inputs fusion in VLM to late stage, achieving better performance while enabling caching and reduce inference overhead and latency; (2) long-short action chunking to ensure smooth, coherent multi-step planning without sacrificing real-time responsiveness; (3) dynamic routing that adaptively assigns lightweight or heavy backbones based on task complexity, further optimizing inference efficiency. Experimental results on several benchmarks, as well as real-world deployments, demonstrate that NanoVLA achieves up to 52x faster inference on edge devices compared to previous state-of-the-art VLA models, with 98% less parameters while maintaining or surpassing their task accuracy and generalization. Ablation studies confirm that our decoupling strategy preserves cross-task transferability, and the routing module enhances cost-performance trade-offs, enabling practical, high-precision robotic manipulation on resource-constrained hardware.

NanoVLA: Routing Decoupled Vision-Language Understanding for Nano-sized Generalist Robotic Policies

TL;DR

The paper addresses the gap between state-of-the-art vision-language-action policies and resource-constrained edge hardware. It introduces NanoVLA, a lightweight VLA framework that decouples vision-language fusion, unrolls actions over time with long-short chunks, and routes computation adaptively across backbones. The method achieves significant edge efficiency—up to 52x faster inference—with far fewer parameters while maintaining or surpassing accuracy on LIBERO benchmarks and real-world tasks. Ablation studies validate the importance of decoupling, chunking, and routing for cross-task transferability and cost-performance trade-offs. This work offers a practical path toward real-time embodied AI on edge devices.

Abstract

Vision-language-action (VLA) models have significantly advanced robotic manipulation by integrating vision-language models (VLMs), and action decoders into a unified architecture. However, their deployment on resource-constrained edge devices, such as mobile robots or embedded systems (e.g., Jetson Orin Nano), remains challenging due to high computational demands, especially in real-world scenarios where power, latency, and computational resources are critical. To close this gap, we introduce Nano-scale Vision-Language Action (NanoVLA), a family of lightweight VLA architectures that achieve high performance with minimal resources. Our core innovations include: (1) vision-language decoupling that moves conventional early vision and language inputs fusion in VLM to late stage, achieving better performance while enabling caching and reduce inference overhead and latency; (2) long-short action chunking to ensure smooth, coherent multi-step planning without sacrificing real-time responsiveness; (3) dynamic routing that adaptively assigns lightweight or heavy backbones based on task complexity, further optimizing inference efficiency. Experimental results on several benchmarks, as well as real-world deployments, demonstrate that NanoVLA achieves up to 52x faster inference on edge devices compared to previous state-of-the-art VLA models, with 98% less parameters while maintaining or surpassing their task accuracy and generalization. Ablation studies confirm that our decoupling strategy preserves cross-task transferability, and the routing module enhances cost-performance trade-offs, enabling practical, high-precision robotic manipulation on resource-constrained hardware.

Paper Structure

This paper contains 32 sections, 22 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Decoupled fusion for efficient VLA policies. Our decoupled fusion strategy (mid) delays vision-to-language fusion in parameter-constrained settings. This approach does enables better performance with less overhead and latency, which informs NanoVLA, small scale VLA that achieves better performance across both simulation and real-world tasks with only $\sim$2% of the parameter of models like OpenVLA, as shown in the Radar plot (right).
  • Figure 2: Overview of NanoVLA framework. Multi-modal inputs are processed independently and fused at a late stage via a lightweight attention layer. This design bypasses the compute-intensive early fusion in VLMs, unlocking key advantages like caching and accelerated inference.
  • Figure 3: LeRobot experiment setup. We design 10 real-world tasks with different manipulation skills and objects with additional 2 unseen tasks for evaluating policy generalization.
  • Figure 4: Inference analysis on Jetson Orin Nano (left) and long-short action chunking ablations.
  • Figure 5: Ablation studies on environmental variant impact (left) and caching effectiveness (right).
  • ...and 10 more figures