Table of Contents
Fetching ...

FrankenSplit: Efficient Neural Feature Compression with Shallow Variational Bottleneck Injection for Mobile Edge Computing

Alireza Furutanpey, Philipp Raith, Schahram Dustdar

TL;DR

FrankenSplit addresses the bandwidth and latency challenges of deploying large vision models on mobile edge devices by shifting neural feature compression to a shallow variational bottleneck, enabled by a fixed lightweight encoder and architecture-aware decoders. The method uses Head Distilled Deep Variational IB with saliency-guided distortion to train a compact encoder that can support multiple server backbones via blueprint decoders, achieving about $60\%$ bitrate reduction and up to $16\times$ faster end-to-end performance compared to offloading under constrained networks. Key contributions include a practical training objective for SVBI, saliency-based regularization, a generalizable encoder-decoder blueprint framework, and extensive cross-architecture evaluation showing robust rate-distortion gains and encoder reusability. The work demonstrates a scalable path for MEC where large foundation models can be accessed from mobile devices without reloading weights for each task, enabling multi-task, real-time inference with reduced communication overhead.

Abstract

The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep Neural Networks (DNNs) on the client side. However, critical applications require powerful models that edge devices cannot host and must therefore offload requests, where the high-dimensional data will compete for limited bandwidth. This work proposes shifting away from focusing on executing shallow layers of partitioned DNNs. Instead, it advocates concentrating the local resources on variational compression optimized for machine interpretability. We introduce a novel framework for resource-conscious compression models and extensively evaluate our method in an environment reflecting the asymmetric resource distribution between edge devices and servers. Our method achieves 60% lower bitrate than a state-of-the-art SC method without decreasing accuracy and is up to 16x faster than offloading with existing codec standards.

FrankenSplit: Efficient Neural Feature Compression with Shallow Variational Bottleneck Injection for Mobile Edge Computing

TL;DR

FrankenSplit addresses the bandwidth and latency challenges of deploying large vision models on mobile edge devices by shifting neural feature compression to a shallow variational bottleneck, enabled by a fixed lightweight encoder and architecture-aware decoders. The method uses Head Distilled Deep Variational IB with saliency-guided distortion to train a compact encoder that can support multiple server backbones via blueprint decoders, achieving about bitrate reduction and up to faster end-to-end performance compared to offloading under constrained networks. Key contributions include a practical training objective for SVBI, saliency-based regularization, a generalizable encoder-decoder blueprint framework, and extensive cross-architecture evaluation showing robust rate-distortion gains and encoder reusability. The work demonstrates a scalable path for MEC where large foundation models can be accessed from mobile devices without reloading weights for each task, enabling multi-task, real-time inference with reduced communication overhead.

Abstract

The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep Neural Networks (DNNs) on the client side. However, critical applications require powerful models that edge devices cannot host and must therefore offload requests, where the high-dimensional data will compete for limited bandwidth. This work proposes shifting away from focusing on executing shallow layers of partitioned DNNs. Instead, it advocates concentrating the local resources on variational compression optimized for machine interpretability. We introduce a novel framework for resource-conscious compression models and extensively evaluate our method in an environment reflecting the asymmetric resource distribution between edge devices and servers. Our method achieves 60% lower bitrate than a state-of-the-art SC method without decreasing accuracy and is up to 16x faster than offloading with existing codec standards.
Paper Structure (45 sections, 11 equations, 20 figures, 7 tables)

This paper contains 45 sections, 11 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Prediction with on/offloading and split runtimes
  • Figure 2: Output dimensionality distribution for ResNet
  • Figure 3: Inference latency for SC and offloading
  • Figure 4: Prediction with Variational Bottleneck Injection
  • Figure 5: Utilizing client resources with (learned) codecs
  • ...and 15 more figures