SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond the Memory Budget

Kun Wang; Jiani Cao; Zimu Zhou; Zhenjiang Li

SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond the Memory Budget

Kun Wang, Jiani Cao, Zimu Zhou, Zhenjiang Li

TL;DR

Edge devices face strict memory budgets that limit on-device DNN deployment. SwapNet provides a memory-efficient middleware that partitions DNNs into blocks, swaps them with zero-copy swap-in, and assembles blocks by reference to preserve accuracy and compatibility with existing toolchains. The approach is demonstrated across 11 DNN tasks on Jetson devices, achieving substantial memory reductions with only modest latency overhead, and it integrates with multi-DNN scheduling to handle concurrent workloads. The results highlight a practical path toward running large models, including potential edge-LM deployments, on resource-constrained edge hardware.

Abstract

Executing deep neural networks (DNNs) on edge artificial intelligence (AI) devices enables various autonomous mobile computing applications. However, the memory budget of edge AI devices restricts the number and complexity of DNNs allowed in such applications. Existing solutions, such as model compression or cloud offloading, reduce the memory footprint of DNN inference at the cost of decreased model accuracy or autonomy. To avoid these drawbacks, we divide DNN into blocks and swap them in and out in order, such that large DNNs can execute within a small memory budget. Nevertheless, naive swapping on edge AI devices induces significant delays due to the redundant memory operations in the DNN development ecosystem for edge AI devices. To this end, we develop SwapNet, an efficient DNN block swapping middleware for edge AI devices. We systematically eliminate the unnecessary memory operations during block swapping while retaining compatible with the deep learning frameworks, GPU backends, and hardware architectures of edge AI devices. We further showcase the utility of SwapNet via a multi-DNN scheduling scheme. Evaluations on eleven DNN inference tasks in three applications demonstrate that SwapNet achieves almost the same latency as the case with sufficient memory even when DNNs demand 2.32x to 5.81x memory beyond the available budget. The design of SwapNet also provides novel and feasible insights for deploying large language models (LLMs) on edge AI devices in the future.

SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond the Memory Budget

TL;DR

Abstract

Paper Structure (32 sections, 1 equation, 19 figures, 3 tables)

This paper contains 32 sections, 1 equation, 19 figures, 3 tables.

Introduction
Preliminaries
Memory Budget of Edge AI devices for DNN Inference
DNN Inference: Edge AI Devices vs. Mobile Devices
SwapNet Overview
SwapNet Block Swapping Controller
Limitations of Standard Block Swapping
Zero-Copy Block Swap-In
Direct Block Fetch
Copy-Free GPU Dispatch
SwapNet Block Swapping Controller in Operation
SwapNet Block Assembly Controller
Limitations of Naive Block Assemble
Block Assemble by Reference
SwapNet Utility: Multi-DNN Scheduling with Efficient Swapping
...and 17 more sections

Figures (19)

Figure 1: Illustration of an edge AI device based autonomous vehicle. It is expected to run multiple DNN and non-DNN tasks at a low memory budget.
Figure 2: Comparison of the DNN development tool chain for mobile devices, edge AI devices, and the PC-grade devices.
Figure 3: Overview of SwapNet design.
Figure 4: Workflow of block SwapNet swap-in. Components in red are modifications on top of the standard block swap-in.
Figure 5: Dependence graph $G$ parsed in PyTorch to trace the location of CPU memory allocation.
...and 14 more figures

SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond the Memory Budget

TL;DR

Abstract

SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond the Memory Budget

Authors

TL;DR

Abstract

Table of Contents

Figures (19)