Multi-Agentic AI for Fairness-Aware and Accelerated Multi-modal Large Model Inference in Real-world Mobile Edge Networks

Haiyuan Li; Hari Madhukumar; Shuangyi Yan; Yulei Wu; Dimitra Simeonidou

Multi-Agentic AI for Fairness-Aware and Accelerated Multi-modal Large Model Inference in Real-world Mobile Edge Networks

Haiyuan Li, Hari Madhukumar, Shuangyi Yan, Yulei Wu, Dimitra Simeonidou

TL;DR

This paper tackles latency and fairness in real-world mobile-edge inference for multi-modal large models. It proposes a hierarchical, two-tier Agentic AI framework with a long-term Global Planning Agent and short-term Prompt Scheduling and LM Deployment Control Agents to coordinate cross-node routing and resource allocation, using natural-language reasoning over runtime telemetry and episodic memory. The authors validate the approach on a Bristol-based city-scale testbed, demonstrating over $80\%$ reduction in end-to-end latency and a normalized Jain fairness index of approximately $0.90$, with strong generalization without task-specific fine-tuning. The work combines a rigorous problem formulation with a practical deployment and evaluation, showing the viability and scalability of LLM-driven infrastructure orchestration for GenAI services at the edge, and points to future directions in federated edge learning and energy-aware optimization.

Abstract

Generative AI (GenAI) has transformed applications in natural language processing and content creation, yet centralized inference remains hindered by high latency, limited customizability, and privacy concerns. Deploying large models (LMs) in mobile edge networks emerges as a promising solution. However, it also poses new challenges, including heterogeneous multi-modal LMs with diverse resource demands and inference speeds, varied prompt/output modalities that complicate orchestration, and resource-limited infrastructure ill-suited for concurrent LM execution. In response, we propose a Multi-Agentic AI framework for latency- and fairness-aware multi-modal LM inference in mobile edge networks. Our solution includes a long-term planning agent, a short-term prompt scheduling agent, and multiple on-node LM deployment agents, all powered by foundation language models. These agents cooperatively optimize prompt routing and LM deployment through natural language reasoning over runtime telemetry and historical experience. To evaluate its performance, we further develop a city-wide testbed that supports network monitoring, containerized LM deployment, intra-server resource management, and inter-server communications. Experiments demonstrate that our solution reduces average latency by over 80% and improves fairness (Normalized Jain index) to 0.90 compared to other baselines. Moreover, our solution adapts quickly without fine-tuning, offering a generalizable solution for optimizing GenAI services in edge environments.

Multi-Agentic AI for Fairness-Aware and Accelerated Multi-modal Large Model Inference in Real-world Mobile Edge Networks

TL;DR

reduction in end-to-end latency and a normalized Jain fairness index of approximately

, with strong generalization without task-specific fine-tuning. The work combines a rigorous problem formulation with a practical deployment and evaluation, showing the viability and scalability of LLM-driven infrastructure orchestration for GenAI services at the edge, and points to future directions in federated edge learning and energy-aware optimization.

Abstract

Paper Structure (20 sections, 17 equations, 7 figures, 2 tables)

This paper contains 20 sections, 17 equations, 7 figures, 2 tables.

Introduction
Literature Review
Model-centric approaches
Resource-centric approaches
System Model and Problem Formulation
Methodology: Hierarchical Decision Framework for Joint Prompt Scheduling and Resource Allocation
Tier-1: Long-term Global Planning Agent
Tier-2: Short-term Prompt Scheduling Agent and LM Deployment Control Agent
Prompt Scheduling Agent
LM Deployment Control Agent
Experiment Setup and Implementation
Network architecture
Containerized LM deployment
Model parallelism on GPU
Network monitoring & management
...and 5 more sections

Figures (7)

Figure 1: A mobile edge network example where MEC servers host LMs of different modalities and collaboratively handle diverse inference requests.
Figure 2: The architecture of the proposed Multi-Agentic AI solution for resource management of Edge LM inference.
Figure 3: Architecture of testbed and the allocation of network elements among servers.
Figure 4: Sequence diagram of the runtime operation. Communications between MEC servers are achieved by Fast API.
Figure 5: Comparison of different scheduling and deployment strategies in terms of end-to-end latency (x-axis) and normalized Jain’s fairness index (y-axis).
...and 2 more figures

Multi-Agentic AI for Fairness-Aware and Accelerated Multi-modal Large Model Inference in Real-world Mobile Edge Networks

TL;DR

Abstract

Multi-Agentic AI for Fairness-Aware and Accelerated Multi-modal Large Model Inference in Real-world Mobile Edge Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)