NanoVLA: Routing Decoupled Vision-Language Understanding for Nano-sized Generalist Robotic Policies
Jiahong Chen, Jing Wang, Long Chen, Chuwei Cai, Jinghui Lu
TL;DR
The paper addresses the gap between state-of-the-art vision-language-action policies and resource-constrained edge hardware. It introduces NanoVLA, a lightweight VLA framework that decouples vision-language fusion, unrolls actions over time with long-short chunks, and routes computation adaptively across backbones. The method achieves significant edge efficiency—up to 52x faster inference—with far fewer parameters while maintaining or surpassing accuracy on LIBERO benchmarks and real-world tasks. Ablation studies validate the importance of decoupling, chunking, and routing for cross-task transferability and cost-performance trade-offs. This work offers a practical path toward real-time embodied AI on edge devices.
Abstract
Vision-language-action (VLA) models have significantly advanced robotic manipulation by integrating vision-language models (VLMs), and action decoders into a unified architecture. However, their deployment on resource-constrained edge devices, such as mobile robots or embedded systems (e.g., Jetson Orin Nano), remains challenging due to high computational demands, especially in real-world scenarios where power, latency, and computational resources are critical. To close this gap, we introduce Nano-scale Vision-Language Action (NanoVLA), a family of lightweight VLA architectures that achieve high performance with minimal resources. Our core innovations include: (1) vision-language decoupling that moves conventional early vision and language inputs fusion in VLM to late stage, achieving better performance while enabling caching and reduce inference overhead and latency; (2) long-short action chunking to ensure smooth, coherent multi-step planning without sacrificing real-time responsiveness; (3) dynamic routing that adaptively assigns lightweight or heavy backbones based on task complexity, further optimizing inference efficiency. Experimental results on several benchmarks, as well as real-world deployments, demonstrate that NanoVLA achieves up to 52x faster inference on edge devices compared to previous state-of-the-art VLA models, with 98% less parameters while maintaining or surpassing their task accuracy and generalization. Ablation studies confirm that our decoupling strategy preserves cross-task transferability, and the routing module enhances cost-performance trade-offs, enabling practical, high-precision robotic manipulation on resource-constrained hardware.
