Table of Contents
Fetching ...

Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression

Mingyu Sung, Suhwan Im, Daeho Bang, Il-Min Kim, Sangseok Yun, Jae-Mo Kang

TL;DR

This work tackles server bottlenecks in edge-cloud DNN inference by addressing fixed split choices and onerous per-token data in autoregressive workloads. It introduces SLICER, a training-free, architecture-agnostic codec that compresses intermediate features via asymmetric top-K filtering, magnitude-splitting, and adaptive bit quantization, guided by a constraint-aware predictive configuration (SLICER-Search). Across vision and NLP benchmarks, SLICER achieves up to $10\times$ uplink reduction and up to $4.4\times$ server-time savings with minimal accuracy loss ($0$–$3\,\text{pp}$), and it scales to multi-device and AR inference by shifting compute toward the edge. The approach attaches to off-the-shelf models without retraining, offering a plug-and-play path to scalable, low-latency distributed inference in real-world deployments.

Abstract

Modern DNNs often rely on edge-cloud model partitioning (MP), but widely used schemes fix shallow, static split points that underutilize edge compute and concentrate latency and energy on the server. The problem is exacerbated in autoregressive (AR) LLM inference, where per-token forward passes repeatedly generate bulky intermediate features (IFs). We introduce SLICER, a retraining-free, architecture-agnostic framework that compresses IFs to reduce both communication and server load in split computing. SLICER combines (i) asymmetric top-K filtering (ATKF) to sparsify low-magnitude activations, (ii) magnitude-splitting (MS) to group the remaining non-zeros into equal-cardinality blocks, and (iii) adaptive bit quantization (ABQ) that selects per-block bitwidths under a distortion budget. Across standard vision and LLM workloads (e.g., ImageNet/COCO; HellaSwag, PIQA, ARC-E/C, GSM8K, HumanEval), SLICER reduces uplink volume by up to 10x and server GPU time by up to 4.4x, while keeping task quality within ~0-3 pp of baseline. In multi-device settings and AR LLMs, SLICER scales by shifting meaningful compute to the edge and lowering bits-per-token and server time per token, stabilizing per-step traffic. The codec attaches to off-the-shelf models without retraining or architectural changes, offering a plug-and-play path to scalable, low-latency distributed inference. Code is provided in the supplementary material.

Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression

TL;DR

This work tackles server bottlenecks in edge-cloud DNN inference by addressing fixed split choices and onerous per-token data in autoregressive workloads. It introduces SLICER, a training-free, architecture-agnostic codec that compresses intermediate features via asymmetric top-K filtering, magnitude-splitting, and adaptive bit quantization, guided by a constraint-aware predictive configuration (SLICER-Search). Across vision and NLP benchmarks, SLICER achieves up to uplink reduction and up to server-time savings with minimal accuracy loss (), and it scales to multi-device and AR inference by shifting compute toward the edge. The approach attaches to off-the-shelf models without retraining, offering a plug-and-play path to scalable, low-latency distributed inference in real-world deployments.

Abstract

Modern DNNs often rely on edge-cloud model partitioning (MP), but widely used schemes fix shallow, static split points that underutilize edge compute and concentrate latency and energy on the server. The problem is exacerbated in autoregressive (AR) LLM inference, where per-token forward passes repeatedly generate bulky intermediate features (IFs). We introduce SLICER, a retraining-free, architecture-agnostic framework that compresses IFs to reduce both communication and server load in split computing. SLICER combines (i) asymmetric top-K filtering (ATKF) to sparsify low-magnitude activations, (ii) magnitude-splitting (MS) to group the remaining non-zeros into equal-cardinality blocks, and (iii) adaptive bit quantization (ABQ) that selects per-block bitwidths under a distortion budget. Across standard vision and LLM workloads (e.g., ImageNet/COCO; HellaSwag, PIQA, ARC-E/C, GSM8K, HumanEval), SLICER reduces uplink volume by up to 10x and server GPU time by up to 4.4x, while keeping task quality within ~0-3 pp of baseline. In multi-device settings and AR LLMs, SLICER scales by shifting meaningful compute to the edge and lowering bits-per-token and server time per token, stabilizing per-step traffic. The codec attaches to off-the-shelf models without retraining or architectural changes, offering a plug-and-play path to scalable, low-latency distributed inference. Code is provided in the supplementary material.

Paper Structure

This paper contains 29 sections, 19 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) Single-shot inference model where multiple front-end devices offload tasks to a shared back-end based on latency constraints. (b) AR inference scenarios illustrating early exits triggered by memory or latency limits. (c) Visualization of sparsity in IFs, a key technique for compression. (d) Performance metrics demonstrating that our approach reduces the back-end server's load as the number of front-end devices increases.
  • Figure 2: Scalability of our framework in multi-device scenario. Left : Single-shot inference showing the back-end throughput for ResNet-34 on ImageNet as the number of front-end devices increases. Right : Cumulative $dev_{0}$ running time required to complete the same BoolQ workload with Llama2-7B, compared against the number of front-end devices.
  • Figure 3: (left) Execution time on the front-end device, IF transmission time, and overhead for the Llama2-7B (BoolQ), comparing different methods across varying device-side computation levels. (right) Computation latency of our framework according to the size of IF.