Table of Contents
Fetching ...

To Offload or Not To Offload: Model-driven Comparison of Edge-native and On-device Processing In the Era of Accelerators

Nathan Ng, David Irwin, Ananthram Swami, Don Towsley, Prashant Shenoy

TL;DR

The paper tackles the problem of when to run computations on a device with accelerators versus offload to edge servers, in the era of programmable accelerators. It develops a two-level, queuing-theoretic framework that uses estimated service times as input to derive end-to-end latency bounds for both on-device and edge processing, including device-edge collaboration and multi-tenant edge scenarios. The authors validate the models across diverse DNN, RNN, and LLM workloads on representative hardware, achieving a mean absolute percentage error of $2.2\%$ and showing accurate crossover predictions that enable a model-driven adaptive resource manager. The resulting toolchain enables dynamic, data-driven decisions to minimize latency in realistic network and multi-tenant edge environments, with practical implications for latency-sensitive applications and edge orchestration.

Abstract

Computational offloading is a promising approach for overcoming resource constraints on client devices by moving some or all of an application's computations to remote servers. With the advent of specialized hardware accelerators, client devices can now perform fast local processing of specific tasks, such as machine learning inference, reducing the need for offloading computations. However, edge servers with accelerators also offer faster processing for offloaded tasks than was previously possible. In this paper, we present an analytic and experimental comparison of on-device processing and edge offloading for a range of accelerator, network, multi-tenant, and application workload scenarios, with the goal of understanding when to use local on-device processing and when to offload computations. We present models that leverage analytical queuing results to derive explainable closed-form equations for the expected end-to-end latencies of both strategies, which yield precise, quantitative performance crossover predictions that guide adaptive offloading. We experimentally validate our models across a range of scenarios and show that they achieve a mean absolute percentage error of 2.2% compared to observed latencies. We further use our models to develop a resource manager for adaptive offloading and show its effectiveness under variable network conditions and dynamic multi-tenant edge settings.

To Offload or Not To Offload: Model-driven Comparison of Edge-native and On-device Processing In the Era of Accelerators

TL;DR

The paper tackles the problem of when to run computations on a device with accelerators versus offload to edge servers, in the era of programmable accelerators. It develops a two-level, queuing-theoretic framework that uses estimated service times as input to derive end-to-end latency bounds for both on-device and edge processing, including device-edge collaboration and multi-tenant edge scenarios. The authors validate the models across diverse DNN, RNN, and LLM workloads on representative hardware, achieving a mean absolute percentage error of and showing accurate crossover predictions that enable a model-driven adaptive resource manager. The resulting toolchain enables dynamic, data-driven decisions to minimize latency in realistic network and multi-tenant edge environments, with practical implications for latency-sensitive applications and edge orchestration.

Abstract

Computational offloading is a promising approach for overcoming resource constraints on client devices by moving some or all of an application's computations to remote servers. With the advent of specialized hardware accelerators, client devices can now perform fast local processing of specific tasks, such as machine learning inference, reducing the need for offloading computations. However, edge servers with accelerators also offer faster processing for offloaded tasks than was previously possible. In this paper, we present an analytic and experimental comparison of on-device processing and edge offloading for a range of accelerator, network, multi-tenant, and application workload scenarios, with the goal of understanding when to use local on-device processing and when to offload computations. We present models that leverage analytical queuing results to derive explainable closed-form equations for the expected end-to-end latencies of both strategies, which yield precise, quantitative performance crossover predictions that guide adaptive offloading. We experimentally validate our models across a range of scenarios and show that they achieve a mean absolute percentage error of 2.2% compared to observed latencies. We further use our models to develop a resource manager for adaptive offloading and show its effectiveness under variable network conditions and dynamic multi-tenant edge settings.

Paper Structure

This paper contains 27 sections, 2 theorems, 5 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.1

For accelerator-driven workloads, edge offloading incurs a higher average end-to-end latency than on-device processing when

Figures (7)

  • Figure 1: Modeling request execution using (a) edge offloading and (b) on-device processing.
  • Figure 2: Latency comparison for DNN workloads: (a-b) MobileNetV2, (c-d) InceptionV4, and (e-f) YOLOv8n.
  • Figure 3: Latency comparison of execution strategies for (a) LSTM and (b) Llama-3.2-1B models.
  • Figure 4: Latency comparison under varying network bandwidth: (a) RTX4070 as the edge server, (b) A2 as the edge server.
  • Figure 5: Latency comparisons of execution strategies under different: (a) collaborative processing configurations, (b) request rates, and (c) numbers of co-located applications.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Lemma 3.1
  • Remark 3.1
  • Remark 3.2
  • Lemma 3.2