Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI
Aladin Djuhera, Fernando Koch, Alecio Binotto
TL;DR
This paper tackles real-time edge inference of large foundation models by enabling joint, runtime partitioning and placement across heterogeneous edge resources. It introduces an adaptive orchestration framework that profiles capacity, re-partitions the LFM graph at runtime, and enforces privacy constraints through selective local execution. The approach is formalized as a constrained optimization over partitions and placements with a modular architecture (monitoring, decision-making, graph re-splitting, and reconfiguration broadcast) and is demonstrated in a 6G/MEC scenario, showing substantial latency and utilization improvements with modest overhead. The framework is designed to be integrable with existing orchestration stacks and extensible to future AI-native scheduling goals.
Abstract
Inference over large-scale foundation models within heterogeneous edge environments necessitates a fundamentally reconfigurable orchestration substrate. Static partitioning of model layers presumes temporal stability across compute and network resources, which is misaligned with the volatility of real-world deployments. We introduce a framework in which both the spatial placement and internal segmentation of foundation models are elevated to runtime-resolved constructs. The orchestration problem is formalized as a constrained optimization over layer-wise assignments, subject to evolving latency, utilization, and privacy gradients. The framework implements reactive inference composition responsive to infrastructural fluctuations by integrating model-aware capacity profiling with dynamic graph re-partitioning and reallocation. We introduce architectural and algorithmic components, along with a representative use case in 6G multi-access edge computing.
