Table of Contents
Fetching ...

Environment-Aware Dynamic Pruning for Pipelined Edge Inference

Austin O'Quinn, Conor Snedeker, Siyuan Zhang, Jenna Kline

TL;DR

The paper tackles the challenge of unpredictable, resource-constrained edge inference by introducing environment-aware dynamic pruning for pipelined edge deployments. It combines pruning-aware training with an online pruning controller that uses precomputed latency and accuracy curves to adapt model slices at runtime without retraining. Empirical results on Raspberry Pi 4B and other hardware show up to 1.5x end-to-end speedups and improved SLO attainment while maintaining practical accuracy, demonstrating robust performance under bursty workloads. This approach provides a practical, low-overhead mechanism for runtime load-balancing across heterogeneous edge devices, enabling more scalable and responsive edge inference systems.

Abstract

IoT and edge-based inference systems require unique solutions to overcome resource limitations and unpredictable environments. In this paper, we propose an environment-aware dynamic pruning system that handles the unpredictability of edge inference pipelines. While traditional pruning approaches can reduce model footprint and compute requirements, they are often performed only once, offline, and are not designed to react to transient or post-deployment device conditions. Similarly, existing pipeline placement strategies may incur high overhead if reconfigured at runtime, limiting their responsiveness. Our approach allows slices of a model, already placed on a distributed pipeline, to be ad-hoc pruned as a means of load-balancing. To support this capability, we introduce two key components: (1) novel training strategies that endow models with robustness to post-deployment pruning, and (2) an adaptive algorithm that determines the optimal pruning level for each node based on monitored bottlenecks. In real-world experiments on a Raspberry Pi 4B cluster running camera-trap workloads, our method achieves a 1.5x speedup and a 3x improvement in service-level objective (SLO) attainment, all while maintaining high accuracy.

Environment-Aware Dynamic Pruning for Pipelined Edge Inference

TL;DR

The paper tackles the challenge of unpredictable, resource-constrained edge inference by introducing environment-aware dynamic pruning for pipelined edge deployments. It combines pruning-aware training with an online pruning controller that uses precomputed latency and accuracy curves to adapt model slices at runtime without retraining. Empirical results on Raspberry Pi 4B and other hardware show up to 1.5x end-to-end speedups and improved SLO attainment while maintaining practical accuracy, demonstrating robust performance under bursty workloads. This approach provides a practical, low-overhead mechanism for runtime load-balancing across heterogeneous edge devices, enabling more scalable and responsive edge inference systems.

Abstract

IoT and edge-based inference systems require unique solutions to overcome resource limitations and unpredictable environments. In this paper, we propose an environment-aware dynamic pruning system that handles the unpredictability of edge inference pipelines. While traditional pruning approaches can reduce model footprint and compute requirements, they are often performed only once, offline, and are not designed to react to transient or post-deployment device conditions. Similarly, existing pipeline placement strategies may incur high overhead if reconfigured at runtime, limiting their responsiveness. Our approach allows slices of a model, already placed on a distributed pipeline, to be ad-hoc pruned as a means of load-balancing. To support this capability, we introduce two key components: (1) novel training strategies that endow models with robustness to post-deployment pruning, and (2) an adaptive algorithm that determines the optimal pruning level for each node based on monitored bottlenecks. In real-world experiments on a Raspberry Pi 4B cluster running camera-trap workloads, our method achieves a 1.5x speedup and a 3x improvement in service-level objective (SLO) attainment, all while maintaining high accuracy.

Paper Structure

This paper contains 22 sections, 3 equations, 5 figures.

Figures (5)

  • Figure 1: Pipeline inference comparison with three stages across four edge devices. (a) An imbalanced pipeline, where stage 1 is notably slower, increasing latency for subsequent inferences, (b) a balanced pipeline with evenly distributed stage durations yielding improved latency and throughput.
  • Figure 2: System design overview: user inputs, pipeline partition, offline benchmark, and dynamic pruning control loop.
  • Figure 3: Speedup curves for BioCLIP across different hardware platforms at various pruning ratios
  • Figure 4: Accuracy curves for BioCLIP tested on DSAIL Camera Trap data mugambi2022dsail using two different sets of hyper-parameters
  • Figure 5: BioCLIP pipeline with varying pruning levels and under many arrival rates