Table of Contents
Fetching ...

Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI

Aladin Djuhera, Fernando Koch, Alecio Binotto

TL;DR

This paper tackles real-time edge inference of large foundation models by enabling joint, runtime partitioning and placement across heterogeneous edge resources. It introduces an adaptive orchestration framework that profiles capacity, re-partitions the LFM graph at runtime, and enforces privacy constraints through selective local execution. The approach is formalized as a constrained optimization over partitions and placements with a modular architecture (monitoring, decision-making, graph re-splitting, and reconfiguration broadcast) and is demonstrated in a 6G/MEC scenario, showing substantial latency and utilization improvements with modest overhead. The framework is designed to be integrable with existing orchestration stacks and extensible to future AI-native scheduling goals.

Abstract

Inference over large-scale foundation models within heterogeneous edge environments necessitates a fundamentally reconfigurable orchestration substrate. Static partitioning of model layers presumes temporal stability across compute and network resources, which is misaligned with the volatility of real-world deployments. We introduce a framework in which both the spatial placement and internal segmentation of foundation models are elevated to runtime-resolved constructs. The orchestration problem is formalized as a constrained optimization over layer-wise assignments, subject to evolving latency, utilization, and privacy gradients. The framework implements reactive inference composition responsive to infrastructural fluctuations by integrating model-aware capacity profiling with dynamic graph re-partitioning and reallocation. We introduce architectural and algorithmic components, along with a representative use case in 6G multi-access edge computing.

Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI

TL;DR

This paper tackles real-time edge inference of large foundation models by enabling joint, runtime partitioning and placement across heterogeneous edge resources. It introduces an adaptive orchestration framework that profiles capacity, re-partitions the LFM graph at runtime, and enforces privacy constraints through selective local execution. The approach is formalized as a constrained optimization over partitions and placements with a modular architecture (monitoring, decision-making, graph re-splitting, and reconfiguration broadcast) and is demonstrated in a 6G/MEC scenario, showing substantial latency and utilization improvements with modest overhead. The framework is designed to be integrable with existing orchestration stacks and extensible to future AI-native scheduling goals.

Abstract

Inference over large-scale foundation models within heterogeneous edge environments necessitates a fundamentally reconfigurable orchestration substrate. Static partitioning of model layers presumes temporal stability across compute and network resources, which is misaligned with the volatility of real-world deployments. We introduce a framework in which both the spatial placement and internal segmentation of foundation models are elevated to runtime-resolved constructs. The orchestration problem is formalized as a constrained optimization over layer-wise assignments, subject to evolving latency, utilization, and privacy gradients. The framework implements reactive inference composition responsive to infrastructural fluctuations by integrating model-aware capacity profiling with dynamic graph re-partitioning and reallocation. We introduce architectural and algorithmic components, along with a representative use case in 6G multi-access edge computing.

Paper Structure

This paper contains 14 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: DSI with LFMs requires adaptive orchestration between edge and cloud nodes to guarantee latency, service quality, and efficient node utilization.
  • Figure 2: Reference architecture of the proposed adaptive split inference orchestration. Sub-split models (S1, S2, S3) are deployed across edge/cloud nodes, while a central orchestrator, guided by real-time capacity profiling, re-splits and reconfigures workloads on demand to meet QoS and privacy constraints.
  • Figure 3: Expected latency vs. bandwidth availability in our 5G MEC urban scenario. Adaptive split inference orchestration results in lower total latency.