Table of Contents
Fetching ...

VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking

Kichang Yang, Seonjun Kim, Minjae Kim, Nairan Zhang, Chi Zhang, Youngki Lee

TL;DR

This paper addresses the I/O bottlenecks of flash-offloaded Vision-Language Models on edge devices by proposing Neuron Chunking, a latency-aware sparsification technique that accounts for flash contiguity rather than merely activation magnitude.The method introduces a contiguity distribution to abstract access patterns, a chunk-based latency model to estimate I/O cost, and a utility-guided greedy algorithm to select high-value, contiguous neuron chunks while respecting a sparsity budget.Empirical results on Jetson Orin Nano and AGX show significant I/O latency reductions (up to 4.65x and 5.76x) with competitive accuracy across multiple models and benchmarks, demonstrating robust improvements across devices and workloads.The work highlights the importance of hardware-aware pruning for edge inference and outlines generalizations to other model families and future hardware trends, offering practical guidance for latency-conscious AI deployment.

Abstract

Edge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson AGX Orin, respectively.

VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking

TL;DR

This paper addresses the I/O bottlenecks of flash-offloaded Vision-Language Models on edge devices by proposing Neuron Chunking, a latency-aware sparsification technique that accounts for flash contiguity rather than merely activation magnitude.The method introduces a contiguity distribution to abstract access patterns, a chunk-based latency model to estimate I/O cost, and a utility-guided greedy algorithm to select high-value, contiguous neuron chunks while respecting a sparsity budget.Empirical results on Jetson Orin Nano and AGX show significant I/O latency reductions (up to 4.65x and 5.76x) with competitive accuracy across multiple models and benchmarks, demonstrating robust improvements across devices and workloads.The work highlights the importance of hardware-aware pruning for edge inference and outlines generalizations to other model families and future hardware trends, offering practical guidance for latency-conscious AI deployment.

Abstract

Edge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson AGX Orin, respectively.

Paper Structure

This paper contains 59 sections, 6 equations, 16 figures, 3 tables, 1 algorithm.

Figures (16)

  • Figure 1: Illustration of conventional sparsification vs. our approach. Existing methods select neurons solely based on activation importance, which often leads to scattered, irregular access patterns with poor I/O efficiency. In contrast, our method explicitly accounts for actual I/O latency, favoring contiguous chunks that achieve better importance–latency trade-offs.
  • Figure 2: Activation-magnitude plot for two workloads: (teal) a ReLU-based LLM in the decode phase and (magenta) a gated-activation-based VLM in the frame appending phase. VLM exhibits a smoother distribution, with much less variation between high and low activation values.
  • Figure 3: Read throughput as a function of block size and number of requests, profiled on Jetson AGX Orin with a Samsung 990 Pro SSD. Throughput quickly saturates and remains stable once the request count exceeds minimal thresholds.
  • Figure 4: Flash read performance under varying access patterns. Left: Throughput vs. block size when reading 128 MB (MLP weight sizes in Qwen2-7B qwen2). Right: Latency vs. sparsity across two access modes—scattered (random) and contiguous (sufficiently block-aligned to saturate throughput: 328 KB on AGX, 236 KB on Nano). Error bars show $\pm$1 std; dashed lines indicate saturate throughput and full-load latency. Experiments use Linux direct I/O directio with 6-thread thread-pool in C++.
  • Figure 5: Comparison between real and estimated flash access latency across models and devices.
  • ...and 11 more figures