Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen
TL;DR
This survey presents a unified, layer-wise framework for deploying foundation model–powered agent services across edge-cloud ecosystems. It systematically analyzes low-level execution optimization (computation, memory, communication) and high-level strategies (resource allocation, parallelism, model and token adaptation) while detailing FM families (LLMs, instruction-tuned, multimodal, MoE, tiny) and compression methods (pruning, quantization, distillation). It also covers AI agents with multi-agent frameworks, planning, memory, and tool use, and discusses batching and practical applications to show how to achieve real-time, QoS-aware deployments. The work identifies critical lessons from heterogeneous hardware, scalability challenges, and elastic serving gaps, and outlines directions for efficiently deploying large FMs, supporting multi-modal and MoE models at the edge-cloud boundary, and building agent-specific serving systems. Overall, the framework and synthesis aim to accelerate real-world adoption of FM-powered agents by integrating hardware-aware optimizations with software-levelFM and agent design choices. The insights have practical implications for designers seeking scalable, low-latency, and cost-effective FM-powered services.
Abstract
Foundation model (FM) powered agent services are regarded as a promising solution to develop intelligent and personalized applications for advancing toward Artificial General Intelligence (AGI). To achieve high reliability and scalability in deploying these agent services, it is essential to collaboratively optimize computational and communication resources, thereby ensuring effective resource allocation and seamless service delivery. In pursuit of this vision, this paper proposes a unified framework aimed at providing a comprehensive survey on deploying FM-based agent services across heterogeneous devices, with the emphasis on the integration of model and resource optimization to establish a robust infrastructure for these services. Particularly, this paper begins with exploring various low-level optimization strategies during inference and studies approaches that enhance system scalability, such as parallelism techniques and resource scaling methods. The paper then discusses several prominent FMs and investigates research efforts focused on inference acceleration, including techniques such as model compression and token reduction. Moreover, the paper also investigates critical components for constructing agent services and highlights notable intelligent applications. Finally, the paper presents potential research directions for developing real-time agent services with high Quality of Service (QoS).
