Range Asymmetric Numeral Systems-Based Lightweight Intermediate Feature Compression for Split Computing of Deep Neural Networks
Mingyu Sung, Suhwan Im, Vikas Palakonda, Jae-Mo Kang
TL;DR
This work addresses the bandwidth bottleneck in split computing for DNN inference by presenting a light-weight, distribution-agnostic compression framework based on Range Asymmetric Numeral Systems (rANS). The method combines asymmetric integer quantization and a sparse CSR representation, reshaping intermediate feature tensors to skew symbol distributions and minimize entropy, all implemented on GPUs for sub-millisecond encoding/decoding. A theoretical cost model links the reshape dimension $N$ to entropy $H(p(N))$, guiding a near-optimal search for $ ilde{N}$ that minimizes total communication and computation cost. Empirical results across vision (ResNet, VGG, SwinT, etc.) and language tasks (Llama2 7B/13B) show substantial data-size reductions (up to 7.2×) with near-baseline accuracy, including significant transmission-time savings for LLMs, demonstrating broad applicability and practical impact for bandwidth-constrained edge-cloud AI deployments.
Abstract
Split computing distributes deep neural network inference between resource-constrained edge devices and cloud servers but faces significant communication bottlenecks when transmitting intermediate features. To this end, in this paper, we propose a novel lightweight compression framework that leverages Range Asymmetric Numeral Systems (rANS) encoding with asymmetric integer quantization and sparse tensor representation to reduce transmission overhead dramatically. Specifically, our approach combines asymmetric integer quantization with a sparse representation technique, eliminating the need for complex probability modeling or network modifications. The key contributions include: (1) a distribution-agnostic compression pipeline that exploits inherent tensor sparsity to achieve bandwidth reduction with minimal computational overhead; (2) an approximate theoretical model that optimizes tensor reshaping dimensions to maximize compression efficiency; and (3) a GPU-accelerated implementation with sub-millisecond encoding/decoding latency. Extensive evaluations across diverse neural architectures (ResNet, VGG16, MobileNetV2, SwinT, DenseNet121, EfficientNetB0) demonstrate that the proposed framework consistently maintains near-baseline accuracy across CIFAR100 and ImageNet benchmarks. Moreover, we validated the framework's effectiveness on advanced natural language processing tasks by employing Llama2 7B and 13B on standard benchmarks such as MMLU, HellaSwag, ARC, PIQA, Winogrande, BoolQ, and OpenBookQA, demonstrating its broad applicability beyond computer vision. Furthermore, this method addresses a fundamental bottleneck in deploying sophisticated artificial intelligence systems in bandwidth-constrained environments without compromising model performance.
