Early-Exit meets Model-Distributed Inference at Edge Networks
Marco Colocrese, Erdem Koyuncu, Hulya Seferoglu
TL;DR
This work tackles the high communication cost of data-distributed inference by proposing MDI-Exit, a decentralized framework that combines model-distributed inference with multiple exit points and adaptive data-admission controls. It leverages queue-length information and confidence-based early exits (with thresholds $T_e^k$ and confidences $C_k(d)$) to decide when to exit, where to offload, and how to admit data, all under a Network Utility Maximization–inspired scheme. The contributions include the design of per-task policies for inference, early-exit, and offloading, plus two data-admission strategies (arrival-rate adaptation and exit-threshold adaptation) guided by queue states; the approach is validated on real edge hardware with DNNs like MobileNetV2 and ResNet-50, showing higher data throughput at fixed accuracy and higher accuracy at fixed data rate, aided by feature-vector compression via an autoencoder. Overall, MDI-Exit enables scalable, latency-aware edge inference across heterogeneous devices, with practical impact for low-latency applications and bandwidth-constrained networks.
Abstract
Distributed inference techniques can be broadly classified into data-distributed and model-distributed schemes. In data-distributed inference (DDI), each worker carries the entire deep neural network (DNN) model but processes only a subset of the data. However, feeding the data to workers results in high communication costs, especially when the data is large. An emerging paradigm is model-distributed inference (MDI), where each worker carries only a subset of DNN layers. In MDI, a source device that has data processes a few layers of DNN and sends the output to a neighboring device, i.e., offloads the rest of the layers. This process ends when all layers are processed in a distributed manner. In this paper, we investigate the design and development of MDI with early-exit, which advocates that there is no need to process all the layers of a model for some data to reach the desired accuracy, i.e., we can exit the model without processing all the layers if target accuracy is reached. We design a framework MDI-Exit that adaptively determines early-exit and offloading policies as well as data admission at the source. Experimental results on a real-life testbed of NVIDIA Nano edge devices show that MDI-Exit processes more data when accuracy is fixed and results in higher accuracy for the fixed data rate.
