Table of Contents
Fetching ...

Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots

Justin Williams, Kishor Datta Gupta, Roy George, Mrinmoy Sarkar

TL;DR

LiteVLA tackles the challenge of running vision-language-action reasoning directly on CPU-bound edge robots without relying on cloud infrastructure. It leverages a compact SmolVLM backbone with LoRA fine-tuning and 4-bit NF4 quantization to enable on-device, CPU-only inference that integrates with ROS 2 for end-to-end visuomotor control on a Raspberry Pi 4. The approach demonstrates an end-to-end perception-to-action loop with an approximate per-query latency of 11 seconds (0.09 Hz), using a hybrid-precision quantization strategy to balance speed, memory, and output stability. This work enables scalable, privacy-preserving edge autonomy for service robotics, disaster response, and defense, and outlines a six-phase ROADMAP toward collaborative, continual, and federated edge intelligence.

Abstract

The deployment of artificial intelligence models at the edge is increasingly critical for autonomous robots operating in GPS-denied environments where local, resource-efficient reasoning is essential. This work demonstrates the feasibility of deploying small Vision-Language Models (VLMs) on mobile robots to achieve real-time scene understanding and reasoning under strict computational constraints. Unlike prior approaches that separate perception from mobility, the proposed framework enables simultaneous movement and reasoning in dynamic environments using only on-board hardware. The system integrates a compact VLM with multimodal perception to perform contextual interpretation directly on embedded hardware, eliminating reliance on cloud connectivity. Experimental validation highlights the balance between computational efficiency, task accuracy, and system responsiveness. Implementation on a mobile robot confirms one of the first successful deployments of small VLMs for concurrent reasoning and mobility at the edge. This work establishes a foundation for scalable, assured autonomy in applications such as service robotics, disaster response, and defense operations.

Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots

TL;DR

LiteVLA tackles the challenge of running vision-language-action reasoning directly on CPU-bound edge robots without relying on cloud infrastructure. It leverages a compact SmolVLM backbone with LoRA fine-tuning and 4-bit NF4 quantization to enable on-device, CPU-only inference that integrates with ROS 2 for end-to-end visuomotor control on a Raspberry Pi 4. The approach demonstrates an end-to-end perception-to-action loop with an approximate per-query latency of 11 seconds (0.09 Hz), using a hybrid-precision quantization strategy to balance speed, memory, and output stability. This work enables scalable, privacy-preserving edge autonomy for service robotics, disaster response, and defense, and outlines a six-phase ROADMAP toward collaborative, continual, and federated edge intelligence.

Abstract

The deployment of artificial intelligence models at the edge is increasingly critical for autonomous robots operating in GPS-denied environments where local, resource-efficient reasoning is essential. This work demonstrates the feasibility of deploying small Vision-Language Models (VLMs) on mobile robots to achieve real-time scene understanding and reasoning under strict computational constraints. Unlike prior approaches that separate perception from mobility, the proposed framework enables simultaneous movement and reasoning in dynamic environments using only on-board hardware. The system integrates a compact VLM with multimodal perception to perform contextual interpretation directly on embedded hardware, eliminating reliance on cloud connectivity. Experimental validation highlights the balance between computational efficiency, task accuracy, and system responsiveness. Implementation on a mobile robot confirms one of the first successful deployments of small VLMs for concurrent reasoning and mobility at the edge. This work establishes a foundation for scalable, assured autonomy in applications such as service robotics, disaster response, and defense operations.

Paper Structure

This paper contains 18 sections, 4 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the LiteVLA system architecture integrating on-device inference, ROS 2 control, and motor actuation.
  • Figure 2: LiteVLA architecture highlighting LoRA-adapted layers and NF4 quantization workflow for CPU deployment.
  • Figure 3: Simplified EDGE-VLA-ROADMAP pipeline showing six evolutionary phases from single-agent LiteVLA deployment to collaborative edge reasoning.