To Offload or Not To Offload: Model-driven Comparison of Edge-native and On-device Processing In the Era of Accelerators
Nathan Ng, David Irwin, Ananthram Swami, Don Towsley, Prashant Shenoy
TL;DR
The paper tackles the problem of when to run computations on a device with accelerators versus offload to edge servers, in the era of programmable accelerators. It develops a two-level, queuing-theoretic framework that uses estimated service times as input to derive end-to-end latency bounds for both on-device and edge processing, including device-edge collaboration and multi-tenant edge scenarios. The authors validate the models across diverse DNN, RNN, and LLM workloads on representative hardware, achieving a mean absolute percentage error of $2.2\%$ and showing accurate crossover predictions that enable a model-driven adaptive resource manager. The resulting toolchain enables dynamic, data-driven decisions to minimize latency in realistic network and multi-tenant edge environments, with practical implications for latency-sensitive applications and edge orchestration.
Abstract
Computational offloading is a promising approach for overcoming resource constraints on client devices by moving some or all of an application's computations to remote servers. With the advent of specialized hardware accelerators, client devices can now perform fast local processing of specific tasks, such as machine learning inference, reducing the need for offloading computations. However, edge servers with accelerators also offer faster processing for offloaded tasks than was previously possible. In this paper, we present an analytic and experimental comparison of on-device processing and edge offloading for a range of accelerator, network, multi-tenant, and application workload scenarios, with the goal of understanding when to use local on-device processing and when to offload computations. We present models that leverage analytical queuing results to derive explainable closed-form equations for the expected end-to-end latencies of both strategies, which yield precise, quantitative performance crossover predictions that guide adaptive offloading. We experimentally validate our models across a range of scenarios and show that they achieve a mean absolute percentage error of 2.2% compared to observed latencies. We further use our models to develop a resource manager for adaptive offloading and show its effectiveness under variable network conditions and dynamic multi-tenant edge settings.
