Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge
Fernando Koch, Aladin Djuhera, Alecio Binotto
TL;DR
The paper tackles the challenge of running Large Foundation Model (LFM) inference at the edge under time-varying network and compute conditions by introducing an adaptive split inference orchestration framework. It extends traditional orchestrators with capacity-aware workload distribution, dynamic partition migration, and real-time reconfiguration to enable joint model placement and layer partitioning at runtime, while emphasizing privacy by localizing sensitive computations. A formal system model and optimization objective balance latency, resource utilization, and privacy, and a reference architecture with Monitoring&CP, Adaptive Orchestrator, Split Revision, and Reconfiguration Broadcast modules is proposed. Empirical projections indicate substantial QoS and privacy benefits in 5G/6G MEC scenarios, including lower latency, higher throughput, better resource utilization, higher SLA compliance, and stronger privacy guarantees, supporting practical AIaaS deployments at the edge.
Abstract
Large Foundation Models (LFMs), including multi-modal and generative models, promise to unlock new capabilities for next-generation Edge AI applications. However, performing inference with LFMs in resource-constrained and heterogeneous edge environments, such as Multi-access Edge Computing (MEC), presents significant challenges for workload orchestration due to time-varying network, compute, and storage conditions. In particular, current split inference strategies, which partition LFM layers across nodes, are not designed to adapt to fluctuating workloads, dynamic bandwidth conditions, or evolving privacy constraints in high-utilization MEC environments. In this work, we propose a novel adaptive split inference orchestration framework that elevates both the placement and partitioning of LFM layers to runtime-tunable variables. Specifically, our framework enables real-time, quality-of-service (QoS)-aware management of inference workloads by extending conventional orchestrators with three key services: (1) Capacity-aware workload distribution, which continuously profiles node resources and selects an optimal subset of MEC nodes; (2) Dynamic partition migration, which transparently relocates pre-cut LFM segments in response to changes in utilization or network conditions; (3) Real-time reconfiguration, which dynamically re-splits LFM layers to balance latency, throughput, and privacy. We formalize the joint placement-partitioning problem, outline a reference architecture and algorithmic workflow, and discuss applicability in representative smart city, V2X, and industrial edge scenarios.
