Table of Contents
Fetching ...

Deploying Foundation Model Powered Agent Services: A Survey

Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen

TL;DR

This survey presents a unified, layer-wise framework for deploying foundation model–powered agent services across edge-cloud ecosystems. It systematically analyzes low-level execution optimization (computation, memory, communication) and high-level strategies (resource allocation, parallelism, model and token adaptation) while detailing FM families (LLMs, instruction-tuned, multimodal, MoE, tiny) and compression methods (pruning, quantization, distillation). It also covers AI agents with multi-agent frameworks, planning, memory, and tool use, and discusses batching and practical applications to show how to achieve real-time, QoS-aware deployments. The work identifies critical lessons from heterogeneous hardware, scalability challenges, and elastic serving gaps, and outlines directions for efficiently deploying large FMs, supporting multi-modal and MoE models at the edge-cloud boundary, and building agent-specific serving systems. Overall, the framework and synthesis aim to accelerate real-world adoption of FM-powered agents by integrating hardware-aware optimizations with software-levelFM and agent design choices. The insights have practical implications for designers seeking scalable, low-latency, and cost-effective FM-powered services.

Abstract

Foundation model (FM) powered agent services are regarded as a promising solution to develop intelligent and personalized applications for advancing toward Artificial General Intelligence (AGI). To achieve high reliability and scalability in deploying these agent services, it is essential to collaboratively optimize computational and communication resources, thereby ensuring effective resource allocation and seamless service delivery. In pursuit of this vision, this paper proposes a unified framework aimed at providing a comprehensive survey on deploying FM-based agent services across heterogeneous devices, with the emphasis on the integration of model and resource optimization to establish a robust infrastructure for these services. Particularly, this paper begins with exploring various low-level optimization strategies during inference and studies approaches that enhance system scalability, such as parallelism techniques and resource scaling methods. The paper then discusses several prominent FMs and investigates research efforts focused on inference acceleration, including techniques such as model compression and token reduction. Moreover, the paper also investigates critical components for constructing agent services and highlights notable intelligent applications. Finally, the paper presents potential research directions for developing real-time agent services with high Quality of Service (QoS).

Deploying Foundation Model Powered Agent Services: A Survey

TL;DR

This survey presents a unified, layer-wise framework for deploying foundation model–powered agent services across edge-cloud ecosystems. It systematically analyzes low-level execution optimization (computation, memory, communication) and high-level strategies (resource allocation, parallelism, model and token adaptation) while detailing FM families (LLMs, instruction-tuned, multimodal, MoE, tiny) and compression methods (pruning, quantization, distillation). It also covers AI agents with multi-agent frameworks, planning, memory, and tool use, and discusses batching and practical applications to show how to achieve real-time, QoS-aware deployments. The work identifies critical lessons from heterogeneous hardware, scalability challenges, and elastic serving gaps, and outlines directions for efficiently deploying large FMs, supporting multi-modal and MoE models at the edge-cloud boundary, and building agent-specific serving systems. Overall, the framework and synthesis aim to accelerate real-world adoption of FM-powered agents by integrating hardware-aware optimizations with software-levelFM and agent design choices. The insights have practical implications for designers seeking scalable, low-latency, and cost-effective FM-powered services.

Abstract

Foundation model (FM) powered agent services are regarded as a promising solution to develop intelligent and personalized applications for advancing toward Artificial General Intelligence (AGI). To achieve high reliability and scalability in deploying these agent services, it is essential to collaboratively optimize computational and communication resources, thereby ensuring effective resource allocation and seamless service delivery. In pursuit of this vision, this paper proposes a unified framework aimed at providing a comprehensive survey on deploying FM-based agent services across heterogeneous devices, with the emphasis on the integration of model and resource optimization to establish a robust infrastructure for these services. Particularly, this paper begins with exploring various low-level optimization strategies during inference and studies approaches that enhance system scalability, such as parallelism techniques and resource scaling methods. The paper then discusses several prominent FMs and investigates research efforts focused on inference acceleration, including techniques such as model compression and token reduction. Moreover, the paper also investigates critical components for constructing agent services and highlights notable intelligent applications. Finally, the paper presents potential research directions for developing real-time agent services with high Quality of Service (QoS).

Paper Structure

This paper contains 51 sections, 14 figures, 13 tables.

Figures (14)

  • Figure 1: The framework of FM-powered agent services. The execution layer runs model inference with low-level optimizations. The resource layer focuses on designing strategies for parallelism and resource scaling. The model and agent layers work on optimizing FMs and various agent components. The application layer constructs different intelligent applications.
  • Figure 2: The framework of our survey. Each technical session corresponds to a layer in Figure \ref{['fig:survey_all']}.
  • Figure 3: A multi-layer optimization framework for edge computing systems serving FMs. At the hardware level, heterogeneous resources such as FPGAs, ASICs, IMCs, CPUs, and GPUs are utilized and optimized in terms of computation and memory. The integrated frameworks can support heterogeneous hardware. The network level focuses on semantic communication.
  • Figure 4: The illustration of resource allocation. Resource allocation in a serving framework primarily involves dynamically adjusting the resource allocation strategy based on real-time resource conditions and query load.
  • Figure 5: The illustration of different parallelism methods. Data parallelism divides the data into multiple micro-batches for processing. Model parallelism partitions a model into several modules (stages). Tensor parallelism splits a tensor into various segments.
  • ...and 9 more figures