Table of Contents
Fetching ...

AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng, Yicheng Qian, Lingyue Zhu, Zhipeng Hu, Luoyi Liang, Qiang Tang, Zhen Liu, Han Yang

TL;DR

The paper tackles the mismatch between GPU-optimized vision–language models and edge NPUs by redesigning both vision and language backbones for integer-only inference. It introduces AutoNeural, a MobileNetV5-based vision encoder and a hybrid Transformer–SSM language backbone, coupled with quantization-aware training to minimize activation and memory bottlenecks. Key results show up to 7× quantization error reduction, up to 14× end-to-end latency improvements, 3× decoding speed, and 4× larger context windows, validated on a Qualcomm SA8295P automotive SoC. This hardware-aware co-design demonstrates substantial practical gains for real-time cockpit AI and highlights the importance of article-specific architectural choices for edge multimodal systems.

Abstract

While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.

AutoNeural: Co-Designing Vision-Language Models for NPU Inference

TL;DR

The paper tackles the mismatch between GPU-optimized vision–language models and edge NPUs by redesigning both vision and language backbones for integer-only inference. It introduces AutoNeural, a MobileNetV5-based vision encoder and a hybrid Transformer–SSM language backbone, coupled with quantization-aware training to minimize activation and memory bottlenecks. Key results show up to 7× quantization error reduction, up to 14× end-to-end latency improvements, 3× decoding speed, and 4× larger context windows, validated on a Qualcomm SA8295P automotive SoC. This hardware-aware co-design demonstrates substantial practical gains for real-time cockpit AI and highlights the importance of article-specific architectural choices for edge multimodal systems.

Abstract

While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.

Paper Structure

This paper contains 33 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Architecture overview of AutoNeural. The model comprises: (1) a MobileNet-based vision encoder with Multi-Scale Fusion Adapter (MSFA) that processes $768{\times}768$ images into 256 visual tokens, (2) a lightweight two-layer MLP connector without normalization for NPU quantization robustness, and (3) the Liquid AI 1.2B hybrid backbone with 16 layers that interleaves 10 gated-convolution layers with 6 Transformer attention layers to reduce memory I/O.
  • Figure 2: Performance of AutoNeural after mixed-precision quantization (vision encoder: W8A16, language model: W4A16) and deployment on Qualcomm SA8295P NPU. Results reflect actual on-device execution, not PyTorch simulation, demonstrating stable accuracy and substantial latency improvements.
  • Figure 3: Vision encoder latency comparison on Qualcomm SA8295P NPU across three input resolutions. AutoNeural's MobileNet-based encoder achieves 5.8$\times$ speedup at 256$\times$256, 14$\times$ speedup at 512$\times$512, and successfully processes 768$\times$768 images in real-time while InternViT-300M exceeds NPU memory capacity. Lower latency is better.