Table of Contents
Fetching ...

Range Asymmetric Numeral Systems-Based Lightweight Intermediate Feature Compression for Split Computing of Deep Neural Networks

Mingyu Sung, Suhwan Im, Vikas Palakonda, Jae-Mo Kang

TL;DR

This work addresses the bandwidth bottleneck in split computing for DNN inference by presenting a light-weight, distribution-agnostic compression framework based on Range Asymmetric Numeral Systems (rANS). The method combines asymmetric integer quantization and a sparse CSR representation, reshaping intermediate feature tensors to skew symbol distributions and minimize entropy, all implemented on GPUs for sub-millisecond encoding/decoding. A theoretical cost model links the reshape dimension $N$ to entropy $H(p(N))$, guiding a near-optimal search for $ ilde{N}$ that minimizes total communication and computation cost. Empirical results across vision (ResNet, VGG, SwinT, etc.) and language tasks (Llama2 7B/13B) show substantial data-size reductions (up to 7.2×) with near-baseline accuracy, including significant transmission-time savings for LLMs, demonstrating broad applicability and practical impact for bandwidth-constrained edge-cloud AI deployments.

Abstract

Split computing distributes deep neural network inference between resource-constrained edge devices and cloud servers but faces significant communication bottlenecks when transmitting intermediate features. To this end, in this paper, we propose a novel lightweight compression framework that leverages Range Asymmetric Numeral Systems (rANS) encoding with asymmetric integer quantization and sparse tensor representation to reduce transmission overhead dramatically. Specifically, our approach combines asymmetric integer quantization with a sparse representation technique, eliminating the need for complex probability modeling or network modifications. The key contributions include: (1) a distribution-agnostic compression pipeline that exploits inherent tensor sparsity to achieve bandwidth reduction with minimal computational overhead; (2) an approximate theoretical model that optimizes tensor reshaping dimensions to maximize compression efficiency; and (3) a GPU-accelerated implementation with sub-millisecond encoding/decoding latency. Extensive evaluations across diverse neural architectures (ResNet, VGG16, MobileNetV2, SwinT, DenseNet121, EfficientNetB0) demonstrate that the proposed framework consistently maintains near-baseline accuracy across CIFAR100 and ImageNet benchmarks. Moreover, we validated the framework's effectiveness on advanced natural language processing tasks by employing Llama2 7B and 13B on standard benchmarks such as MMLU, HellaSwag, ARC, PIQA, Winogrande, BoolQ, and OpenBookQA, demonstrating its broad applicability beyond computer vision. Furthermore, this method addresses a fundamental bottleneck in deploying sophisticated artificial intelligence systems in bandwidth-constrained environments without compromising model performance.

Range Asymmetric Numeral Systems-Based Lightweight Intermediate Feature Compression for Split Computing of Deep Neural Networks

TL;DR

This work addresses the bandwidth bottleneck in split computing for DNN inference by presenting a light-weight, distribution-agnostic compression framework based on Range Asymmetric Numeral Systems (rANS). The method combines asymmetric integer quantization and a sparse CSR representation, reshaping intermediate feature tensors to skew symbol distributions and minimize entropy, all implemented on GPUs for sub-millisecond encoding/decoding. A theoretical cost model links the reshape dimension to entropy , guiding a near-optimal search for that minimizes total communication and computation cost. Empirical results across vision (ResNet, VGG, SwinT, etc.) and language tasks (Llama2 7B/13B) show substantial data-size reductions (up to 7.2×) with near-baseline accuracy, including significant transmission-time savings for LLMs, demonstrating broad applicability and practical impact for bandwidth-constrained edge-cloud AI deployments.

Abstract

Split computing distributes deep neural network inference between resource-constrained edge devices and cloud servers but faces significant communication bottlenecks when transmitting intermediate features. To this end, in this paper, we propose a novel lightweight compression framework that leverages Range Asymmetric Numeral Systems (rANS) encoding with asymmetric integer quantization and sparse tensor representation to reduce transmission overhead dramatically. Specifically, our approach combines asymmetric integer quantization with a sparse representation technique, eliminating the need for complex probability modeling or network modifications. The key contributions include: (1) a distribution-agnostic compression pipeline that exploits inherent tensor sparsity to achieve bandwidth reduction with minimal computational overhead; (2) an approximate theoretical model that optimizes tensor reshaping dimensions to maximize compression efficiency; and (3) a GPU-accelerated implementation with sub-millisecond encoding/decoding latency. Extensive evaluations across diverse neural architectures (ResNet, VGG16, MobileNetV2, SwinT, DenseNet121, EfficientNetB0) demonstrate that the proposed framework consistently maintains near-baseline accuracy across CIFAR100 and ImageNet benchmarks. Moreover, we validated the framework's effectiveness on advanced natural language processing tasks by employing Llama2 7B and 13B on standard benchmarks such as MMLU, HellaSwag, ARC, PIQA, Winogrande, BoolQ, and OpenBookQA, demonstrating its broad applicability beyond computer vision. Furthermore, this method addresses a fundamental bottleneck in deploying sophisticated artificial intelligence systems in bandwidth-constrained environments without compromising model performance.

Paper Structure

This paper contains 25 sections, 9 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) Schematic diagram of split computing (SC). Due to limited memory on the edge device, only the initial layers of a DNN run locally, while the cloud server processes subsequent layers. IFs from the edge are compressed, transmitted over a wireless link, and decoded on the cloud side before final-layer inference. The four main latency contributors in SC are: (i) edge encoding, (ii) wireless transfer, (iii) cloud decoding, and (iv) GPU integration. (b) Illustrative example of rANS encoding & decoding. Symbols ('A', 'B') are successively encoded into a single state using rANS; decoding recovers the symbols by reversing these state transitions. Notation such as $s_i$ denotes the internal state after processing symbol $i$. (c) Overview of our proposed rANS-based compression pipeline. An IF tensor $X \in \mathbb{R}^{C \times H \times W}$ is reshaped and quantized to produce integer symbols, which are then packed into a modified CSR format and concatenated into a single vector $\mathbf{D}$. Finally, rANS encodes $\mathbf{D}$ into a compact bitstream that is transmitted to the cloud and decoded prior to completing the final DNN layers.
  • Figure 2: Illustration of how reshaping an IF $X \in \mathbb{R}^{128\times 28\times 28}$ affects the data distribution and entropy, ultimately impacting the compressed size. Each subfigure corresponds to reshaping $X$ into $\mathbb{R}^{784\times 128}$, $\mathbb{R}^{1792\times 56}$, $\mathbb{R}^{6272\times 16}$, and $\mathbb{R}^{14336\times 7}$, respectively. The histograms depict how the frequency distribution of unique values (post-quantization) shifts with different reshape dimensions, while the reported entropies and compressed sizes underscore the correlation between a more skewed distribution and improved compression.
  • Figure 3: Measured $\alpha_{\mathrm{enc}} \cdot T_{\mathrm{enc}}(N)$ (top) and $\alpha_{\mathrm{dec}} \cdot T_{\mathrm{dec}}(N)$ (bottom) in milliseconds, as a function of the reshape dimension $N$. Despite varying $N$ over several orders of magnitude, both operations exhibit nearly constant runtimes, indicating that GPU-based parallel processing keeps overhead low in practice. Error bars denote standard deviations across multiple trials.
  • Figure 4: Results for ResNet34 with SL2 on CIFAR100 at various bit-widths ($Q=2,4,6,8$). Blue curves show compressed data sizes, while orange curves depict $T_{\mathrm{tot}}(N)$.