VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking

Kichang Yang; Seonjun Kim; Minjae Kim; Nairan Zhang; Chi Zhang; Youngki Lee

VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking

Kichang Yang, Seonjun Kim, Minjae Kim, Nairan Zhang, Chi Zhang, Youngki Lee

TL;DR

This paper addresses the I/O bottlenecks of flash-offloaded Vision-Language Models on edge devices by proposing Neuron Chunking, a latency-aware sparsification technique that accounts for flash contiguity rather than merely activation magnitude.The method introduces a contiguity distribution to abstract access patterns, a chunk-based latency model to estimate I/O cost, and a utility-guided greedy algorithm to select high-value, contiguous neuron chunks while respecting a sparsity budget.Empirical results on Jetson Orin Nano and AGX show significant I/O latency reductions (up to 4.65x and 5.76x) with competitive accuracy across multiple models and benchmarks, demonstrating robust improvements across devices and workloads.The work highlights the importance of hardware-aware pruning for edge inference and outlines generalizations to other model families and future hardware trends, offering practical guidance for latency-conscious AI deployment.

Abstract

Edge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson AGX Orin, respectively.

VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking

TL;DR

Abstract

VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)