Table of Contents
Fetching ...

Accuracy-Delay Trade-Off in LLM Offloading via Token-Level Uncertainty

Yumin Kim, Hyeonsu Lyu, Minjae Lee, Hyun Jong Yang

TL;DR

This work tackles the accuracy–delay trade-off for LLM inference in multi-user mobile-edge computing by introducing token-level uncertainty as a decision criterion. It defines a margin-based uncertainty metric $\alpha_i$ and proposes GOA, a greedy offloading algorithm that prioritizes high-uncertainty tasks to edge servers while accounting for wireless and compute constraints via a total delay model $d_{i,j}$. The framework is formulated as a resource-aware optimization and shown to be NP-hard, yet GOA achieves strong performance with $O(N^3 M^2)$ complexity, delivering favorable accuracy–delay trade-offs across varying user densities. Empirical results on LLaMA/LLaMA-like models with the bAbI dataset demonstrate GOA’s superiority over baselines in both accuracy and latency, with practical runtimes, highlighting its potential for scalable MEC-enabled LLM services.

Abstract

Large language models (LLMs) offer significant potential for intelligent mobile services but are computationally intensive for resource-constrained devices. Mobile edge computing (MEC) allows such devices to offload inference tasks to edge servers (ESs), yet introduces latency due to communication and serverside queuing, especially in multi-user environments. In this work, we propose an uncertainty-aware offloading framework that dynamically decides whether to perform inference locally or offload it to the ES, based on token-level uncertainty and resource constraints. We define a margin-based token-level uncertainty metric and demonstrate its correlation with model accuracy. Leveraging this metric, we design a greedy offloading algorithm (GOA) that minimizes delay while maintaining accuracy by prioritizing offloading for highuncertainty queries. Our experiments show that GOA consistently achieves a favorable trade-off, outperforming baseline strategies in both accuracy and latency across varying user densities, and operates with practical computation time. These results establish GOA as a scalable and effective solution for LLM inference in MEC environments.

Accuracy-Delay Trade-Off in LLM Offloading via Token-Level Uncertainty

TL;DR

This work tackles the accuracy–delay trade-off for LLM inference in multi-user mobile-edge computing by introducing token-level uncertainty as a decision criterion. It defines a margin-based uncertainty metric and proposes GOA, a greedy offloading algorithm that prioritizes high-uncertainty tasks to edge servers while accounting for wireless and compute constraints via a total delay model . The framework is formulated as a resource-aware optimization and shown to be NP-hard, yet GOA achieves strong performance with complexity, delivering favorable accuracy–delay trade-offs across varying user densities. Empirical results on LLaMA/LLaMA-like models with the bAbI dataset demonstrate GOA’s superiority over baselines in both accuracy and latency, with practical runtimes, highlighting its potential for scalable MEC-enabled LLM services.

Abstract

Large language models (LLMs) offer significant potential for intelligent mobile services but are computationally intensive for resource-constrained devices. Mobile edge computing (MEC) allows such devices to offload inference tasks to edge servers (ESs), yet introduces latency due to communication and serverside queuing, especially in multi-user environments. In this work, we propose an uncertainty-aware offloading framework that dynamically decides whether to perform inference locally or offload it to the ES, based on token-level uncertainty and resource constraints. We define a margin-based token-level uncertainty metric and demonstrate its correlation with model accuracy. Leveraging this metric, we design a greedy offloading algorithm (GOA) that minimizes delay while maintaining accuracy by prioritizing offloading for highuncertainty queries. Our experiments show that GOA consistently achieves a favorable trade-off, outperforming baseline strategies in both accuracy and latency across varying user densities, and operates with practical computation time. These results establish GOA as a scalable and effective solution for LLM inference in MEC environments.
Paper Structure (14 sections, 12 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 14 sections, 12 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: System model of LLM inference at local and edge. On the left, the entire tasks are offloaded to an ES resulting in high delay due to limited resources. On the right, our proposed system model adaptively determines the offloading decision based on both uncertainty and resource constraints.
  • Figure 2: Analysis on uncertainty
  • Figure 3: Comparison of offloading strategies under $\tau=0.6$: (a) accuracy and (b) delay according to the number of users $N$.
  • Figure 4: Comparison of GOA performance under different uncertainty thresholds $\tau$: (a) accuracy and (b) delay
  • Figure 5: Comparison of GOA performance under different uncertainty metric: (a) accuracy and (b) delay