Smaller, Smarter, Closer: The Edge of Collaborative Generative AI
Roberto Morabito, SiYoung Jang
TL;DR
The paper addresses the latency, cost, and privacy limitations of cloud-centric GenAI by proposing an edge-centric collaborative inference framework that leverages Small Language Models (SLMs) across a computing continuum. It introduces a three-way cooperation model—Data, Computation, and Knowledge—and a practical architecture featuring a decentralized Capability Metadata Store (CMS), semantic discovery, a Task Orchestrator, and a Classifier Engine to enable scalable, distributed inference. Through application scenarios in mobile healthcare and urban intelligence, the work demonstrates how edge devices can collaboratively process multi-modal data while querying domain knowledge as needed, and it analyzes scheduling strategies that significantly reduce cloud usage. The approach offers actionable guidance for deploying GenAI with improved latency, privacy, and resilience, outlining future research directions in dynamic task delegation, interoperability among heterogeneous SLMs, and incentive-based resource sharing across the edge-cloud boundary.
Abstract
The rapid adoption of generative AI (GenAI), particularly Large Language Models (LLMs), has exposed critical limitations of cloud-centric deployments, including latency, cost, and privacy concerns. Meanwhile, Small Language Models (SLMs) are emerging as viable alternatives for resource-constrained edge environments, though they often lack the capabilities of their larger counterparts. This article explores the potential of collaborative inference systems that leverage both edge and cloud resources to address these challenges. By presenting distinct cooperation strategies alongside practical design principles and experimental insights, we offer actionable guidance for deploying GenAI across the computing continuum.
