Table of Contents
Fetching ...

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

Yilong Li, Shuai Zhang, Yijing Zeng, Hao Zhang, Xinmiao Xiong, Jingyu Liu, Pan Hu, Suman Banerjee

TL;DR

The paper tackles the challenge of deploying large multimodal models on battery-powered edge devices by introducing Nanomind, a hardware–software co-design that decomposes LMMs into modular components and dynamically offloads them to the most suitable accelerators on unified-memory SoCs. It combines custom hardware, module-level scheduling, and low-bit computation kernels, enabling on-device inference with significant energy and memory savings and prolonged operation (e.g., ~20.8 hours on a 2000 mAh battery). Key innovations include a cross-accelerator offloading framework, a token-aware buffer manager for zero-copy transfers, hybrid quantization across modules, and an On-Demand Cascade Inference mode for ultra-low-power scenarios. The results demonstrate substantial practical impact for private, offline multimodal AI on small devices, advancing toward democratized edge intelligence.

Abstract

Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly 20.8 hours.

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

TL;DR

The paper tackles the challenge of deploying large multimodal models on battery-powered edge devices by introducing Nanomind, a hardware–software co-design that decomposes LMMs into modular components and dynamically offloads them to the most suitable accelerators on unified-memory SoCs. It combines custom hardware, module-level scheduling, and low-bit computation kernels, enabling on-device inference with significant energy and memory savings and prolonged operation (e.g., ~20.8 hours on a 2000 mAh battery). Key innovations include a cross-accelerator offloading framework, a token-aware buffer manager for zero-copy transfers, hybrid quantization across modules, and an On-Demand Cascade Inference mode for ultra-low-power scenarios. The results demonstrate substantial practical impact for private, offline multimodal AI on small devices, advancing toward democratized edge intelligence.

Abstract

Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly 20.8 hours.

Paper Structure

This paper contains 21 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Workflow of Nanomind: VLM Offloading to NPU/GPU with Zero-Copy Embedding Transfer via Ring Buffer.
  • Figure 2: Workflow of Low-Power On-Demand Cascade inference. Each modular models follows a "$load \rightarrow execute \rightarrow release$" workflow that once completes the inference and releases the hardware resources immediately.
  • Figure 3: Architecture of Nanomind: Enable Multimodal Inference via Software-Hardware (SW/HW) Co-design.
  • Figure 4: Nanomind hardware design and PCB layout. (a) Block diagram of hardware components: an RK3566 SoC, a PMU IC for power monitoring, and LPDDR4x memory modules in parallel; (b) front view of PCB design; (c) back view of PCB design.
  • Figure 5: Memory utilization (GB) across different hardware platforms and LLM frameworks: Llava-onevision-0.5B, Qwen2-VL-2B-Instruct, and SmolVLM-500M.
  • ...and 8 more figures