Table of Contents
Fetching ...

Smaller, Smarter, Closer: The Edge of Collaborative Generative AI

Roberto Morabito, SiYoung Jang

TL;DR

The paper addresses the latency, cost, and privacy limitations of cloud-centric GenAI by proposing an edge-centric collaborative inference framework that leverages Small Language Models (SLMs) across a computing continuum. It introduces a three-way cooperation model—Data, Computation, and Knowledge—and a practical architecture featuring a decentralized Capability Metadata Store (CMS), semantic discovery, a Task Orchestrator, and a Classifier Engine to enable scalable, distributed inference. Through application scenarios in mobile healthcare and urban intelligence, the work demonstrates how edge devices can collaboratively process multi-modal data while querying domain knowledge as needed, and it analyzes scheduling strategies that significantly reduce cloud usage. The approach offers actionable guidance for deploying GenAI with improved latency, privacy, and resilience, outlining future research directions in dynamic task delegation, interoperability among heterogeneous SLMs, and incentive-based resource sharing across the edge-cloud boundary.

Abstract

The rapid adoption of generative AI (GenAI), particularly Large Language Models (LLMs), has exposed critical limitations of cloud-centric deployments, including latency, cost, and privacy concerns. Meanwhile, Small Language Models (SLMs) are emerging as viable alternatives for resource-constrained edge environments, though they often lack the capabilities of their larger counterparts. This article explores the potential of collaborative inference systems that leverage both edge and cloud resources to address these challenges. By presenting distinct cooperation strategies alongside practical design principles and experimental insights, we offer actionable guidance for deploying GenAI across the computing continuum.

Smaller, Smarter, Closer: The Edge of Collaborative Generative AI

TL;DR

The paper addresses the latency, cost, and privacy limitations of cloud-centric GenAI by proposing an edge-centric collaborative inference framework that leverages Small Language Models (SLMs) across a computing continuum. It introduces a three-way cooperation model—Data, Computation, and Knowledge—and a practical architecture featuring a decentralized Capability Metadata Store (CMS), semantic discovery, a Task Orchestrator, and a Classifier Engine to enable scalable, distributed inference. Through application scenarios in mobile healthcare and urban intelligence, the work demonstrates how edge devices can collaboratively process multi-modal data while querying domain knowledge as needed, and it analyzes scheduling strategies that significantly reduce cloud usage. The approach offers actionable guidance for deploying GenAI with improved latency, privacy, and resilience, outlining future research directions in dynamic task delegation, interoperability among heterogeneous SLMs, and incentive-based resource sharing across the edge-cloud boundary.

Abstract

The rapid adoption of generative AI (GenAI), particularly Large Language Models (LLMs), has exposed critical limitations of cloud-centric deployments, including latency, cost, and privacy concerns. Meanwhile, Small Language Models (SLMs) are emerging as viable alternatives for resource-constrained edge environments, though they often lack the capabilities of their larger counterparts. This article explores the potential of collaborative inference systems that leverage both edge and cloud resources to address these challenges. By presenting distinct cooperation strategies alongside practical design principles and experimental insights, we offer actionable guidance for deploying GenAI across the computing continuum.

Paper Structure

This paper contains 15 sections, 6 figures.

Figures (6)

  • Figure 1: Interaction between edge-based and cloud-based SLM-enabled agents.
  • Figure 2: Smaller, Smarter, Closer: (1) Progression towards smaller language models with comparable or improved efficiency (top); (2) Improved accuracy of smaller models as measured by Pass@1 scores (middle); and (3) Infrastructure-related latency comparisons across global locations for major AI systems, highlighting geographical disparities in access speed (bottom).
  • Figure 3: Task-oriented cooperation types: Data, Computation, and Knowledge cooperation strategies for multi-agent collaboration. The Capability Metadata Store (CMS), introduced later in the paper, is depicted here as a key enabler for coordination and metadata-driven interactions among agents.
  • Figure 4: On top, our conceptual framework for enabling collaboration among distributed SLM-enabled agents. On the bottom, the table illustrates examples of CMS entries for coordinating data, computation, and knowledge cooperation across agents.
  • Figure 5: A collaborative language model inference scenario illustrating (a) sequential collaboration and (b) parallel collaboration.
  • ...and 1 more figures