Table of Contents
Fetching ...

Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge

Fernando Koch, Aladin Djuhera, Alecio Binotto

TL;DR

The paper tackles the challenge of running Large Foundation Model (LFM) inference at the edge under time-varying network and compute conditions by introducing an adaptive split inference orchestration framework. It extends traditional orchestrators with capacity-aware workload distribution, dynamic partition migration, and real-time reconfiguration to enable joint model placement and layer partitioning at runtime, while emphasizing privacy by localizing sensitive computations. A formal system model and optimization objective balance latency, resource utilization, and privacy, and a reference architecture with Monitoring&CP, Adaptive Orchestrator, Split Revision, and Reconfiguration Broadcast modules is proposed. Empirical projections indicate substantial QoS and privacy benefits in 5G/6G MEC scenarios, including lower latency, higher throughput, better resource utilization, higher SLA compliance, and stronger privacy guarantees, supporting practical AIaaS deployments at the edge.

Abstract

Large Foundation Models (LFMs), including multi-modal and generative models, promise to unlock new capabilities for next-generation Edge AI applications. However, performing inference with LFMs in resource-constrained and heterogeneous edge environments, such as Multi-access Edge Computing (MEC), presents significant challenges for workload orchestration due to time-varying network, compute, and storage conditions. In particular, current split inference strategies, which partition LFM layers across nodes, are not designed to adapt to fluctuating workloads, dynamic bandwidth conditions, or evolving privacy constraints in high-utilization MEC environments. In this work, we propose a novel adaptive split inference orchestration framework that elevates both the placement and partitioning of LFM layers to runtime-tunable variables. Specifically, our framework enables real-time, quality-of-service (QoS)-aware management of inference workloads by extending conventional orchestrators with three key services: (1) Capacity-aware workload distribution, which continuously profiles node resources and selects an optimal subset of MEC nodes; (2) Dynamic partition migration, which transparently relocates pre-cut LFM segments in response to changes in utilization or network conditions; (3) Real-time reconfiguration, which dynamically re-splits LFM layers to balance latency, throughput, and privacy. We formalize the joint placement-partitioning problem, outline a reference architecture and algorithmic workflow, and discuss applicability in representative smart city, V2X, and industrial edge scenarios.

Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge

TL;DR

The paper tackles the challenge of running Large Foundation Model (LFM) inference at the edge under time-varying network and compute conditions by introducing an adaptive split inference orchestration framework. It extends traditional orchestrators with capacity-aware workload distribution, dynamic partition migration, and real-time reconfiguration to enable joint model placement and layer partitioning at runtime, while emphasizing privacy by localizing sensitive computations. A formal system model and optimization objective balance latency, resource utilization, and privacy, and a reference architecture with Monitoring&CP, Adaptive Orchestrator, Split Revision, and Reconfiguration Broadcast modules is proposed. Empirical projections indicate substantial QoS and privacy benefits in 5G/6G MEC scenarios, including lower latency, higher throughput, better resource utilization, higher SLA compliance, and stronger privacy guarantees, supporting practical AIaaS deployments at the edge.

Abstract

Large Foundation Models (LFMs), including multi-modal and generative models, promise to unlock new capabilities for next-generation Edge AI applications. However, performing inference with LFMs in resource-constrained and heterogeneous edge environments, such as Multi-access Edge Computing (MEC), presents significant challenges for workload orchestration due to time-varying network, compute, and storage conditions. In particular, current split inference strategies, which partition LFM layers across nodes, are not designed to adapt to fluctuating workloads, dynamic bandwidth conditions, or evolving privacy constraints in high-utilization MEC environments. In this work, we propose a novel adaptive split inference orchestration framework that elevates both the placement and partitioning of LFM layers to runtime-tunable variables. Specifically, our framework enables real-time, quality-of-service (QoS)-aware management of inference workloads by extending conventional orchestrators with three key services: (1) Capacity-aware workload distribution, which continuously profiles node resources and selects an optimal subset of MEC nodes; (2) Dynamic partition migration, which transparently relocates pre-cut LFM segments in response to changes in utilization or network conditions; (3) Real-time reconfiguration, which dynamically re-splits LFM layers to balance latency, throughput, and privacy. We formalize the joint placement-partitioning problem, outline a reference architecture and algorithmic workflow, and discuss applicability in representative smart city, V2X, and industrial edge scenarios.

Paper Structure

This paper contains 17 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Reference architecture of the proposed adaptive split inference orchestration. Sub-split models (S1, S2, S3) are deployed across edge/cloud nodes, while a central orchestrator, guided by real-time capacity profiling, re-splits and reconfigures workloads on demand to meet QoS and privacy constraints. A corresponding workflow diagram of our proposed Algorithm \ref{['alg:workflow']} is given in Figure \ref{['fig:algorithm_workflow']}.
  • Figure 2: Control-flow diagram of the adaptive split orchestration loop described in Algorithm \ref{['alg:workflow']}. The orchestrator periodically monitors environment metrics and triggers reconfiguration decisions when QoS thresholds or privacy constraints are violated. Feasible placements are evaluated, and, if no cool-down limit is active, a new mapping is broadcast to all nodes.
  • Figure 3: CDF of end-to-end inference latency for static (solid) vs. adaptive (dashed) split inference in a 5G-MEC scenario. 95 % of adaptive requests finish within 300 ms, while static requests may take up to 1s zhang2025amp4ecadaptivemodelpartitioningtuli2022splitplaceaiaugmentedsplittingEdgeShardmudvari2024adaptivecompressionawaresplitlearning.