Table of Contents
Fetching ...

CE-LSLM: Efficient Large-Small Language Model Inference and Communication via Cloud-Edge Collaboration

Pengyan Zhu, Tingting Yang

TL;DR

The paper tackles latency and privacy challenges in edge inference for 6G by proposing CE-LSLM, a cloud–edge collaborative framework that delegates long-context processing to a cloud LLM while delegating lightweight, local decoding to edge SLMs. It introduces dynamic KV cache sharing across cloud–edge and peer-edge, layer-alignment via layer similarity metrics, and dimensionality reduction for attention caches, all coordinated by a pipelined compute–load scheduling approach. Key contributions include the hierarchical cache sharing framework, dynamic cache management, layer-wise KV sharing strategies, and a pipeline-based latency reduction mechanism, demonstrated to improve latency and throughput while preserving data privacy. These methods enable scalable, private, multi-tenant edge deployments for 6G intelligent services, with robustness to network disruptions and concurrent workloads. $T_{total}=T_{com_C}+T_{comm}+T_{com_E}$ serves as the core latency objective guiding the design and evaluation.

Abstract

Emerging intelligent service scenarios in 6G communication impose stringent requirements for low latency, high reliability, and privacy preservation. Generative large language models (LLMs) are gradually becoming key enablers for the integration of semantic communication and computation. However, due to the limited computational resources of edge devices and the increasing complexity of heterogeneous terminal access, existing centralized inference approaches fail to meet the dual demands of response efficiency and data privacy in edge-side inference tasks. To address these challenges, this paper proposes a novel collaborative inference architecture that integrates cloud-based LLMs with edge-deployed small language models (SLMs), enabling dynamic scheduling and sharing of semantic-level intermediate states, and establishing a unified computation-communication paradigm tailored for 6G networks. Specifically, a key-value (KV) cache reuse mechanism is introduced to enhance the semantic understanding of edge models through contextual guidance from the cloud, while significantly reducing edge-side computational and storage overhead. Furthermore, a cross-node parallel scheduling mechanism is proposed to achieve asynchronous coordination between model state loading and decoding computation, thereby improving edge responsiveness. In addition, we investigate layer alignment and representation compression strategies between heterogeneous models to alleviate the communication burden on the edge. Experimental results demonstrate that the proposed architecture exhibits superior adaptability and scalability in terms of inference latency, system stability, and concurrent processing capacity.

CE-LSLM: Efficient Large-Small Language Model Inference and Communication via Cloud-Edge Collaboration

TL;DR

The paper tackles latency and privacy challenges in edge inference for 6G by proposing CE-LSLM, a cloud–edge collaborative framework that delegates long-context processing to a cloud LLM while delegating lightweight, local decoding to edge SLMs. It introduces dynamic KV cache sharing across cloud–edge and peer-edge, layer-alignment via layer similarity metrics, and dimensionality reduction for attention caches, all coordinated by a pipelined compute–load scheduling approach. Key contributions include the hierarchical cache sharing framework, dynamic cache management, layer-wise KV sharing strategies, and a pipeline-based latency reduction mechanism, demonstrated to improve latency and throughput while preserving data privacy. These methods enable scalable, private, multi-tenant edge deployments for 6G intelligent services, with robustness to network disruptions and concurrent workloads. serves as the core latency objective guiding the design and evaluation.

Abstract

Emerging intelligent service scenarios in 6G communication impose stringent requirements for low latency, high reliability, and privacy preservation. Generative large language models (LLMs) are gradually becoming key enablers for the integration of semantic communication and computation. However, due to the limited computational resources of edge devices and the increasing complexity of heterogeneous terminal access, existing centralized inference approaches fail to meet the dual demands of response efficiency and data privacy in edge-side inference tasks. To address these challenges, this paper proposes a novel collaborative inference architecture that integrates cloud-based LLMs with edge-deployed small language models (SLMs), enabling dynamic scheduling and sharing of semantic-level intermediate states, and establishing a unified computation-communication paradigm tailored for 6G networks. Specifically, a key-value (KV) cache reuse mechanism is introduced to enhance the semantic understanding of edge models through contextual guidance from the cloud, while significantly reducing edge-side computational and storage overhead. Furthermore, a cross-node parallel scheduling mechanism is proposed to achieve asynchronous coordination between model state loading and decoding computation, thereby improving edge responsiveness. In addition, we investigate layer alignment and representation compression strategies between heterogeneous models to alleviate the communication burden on the edge. Experimental results demonstrate that the proposed architecture exhibits superior adaptability and scalability in terms of inference latency, system stability, and concurrent processing capacity.

Paper Structure

This paper contains 22 sections, 33 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Collaborative Inference Workflow of Cloud LLM and Edge SLMs Across Heterogeneous Scenarios.
  • Figure 2: System architecture illustrating multi-level contextual KV cache scheduling and interaction between SLMs and LLM. The SLM first searches for shallow-layer contextual KV caches from peer or local edge storage to enable early-stage decoding. If a match is found, decoding starts while deeper-layer KV blocks—generated by the cloud LLM—are concurrently loaded from local disk or retrieved from cloud storage. This layer-aware, pipelined mechanism enables fast and resilient inference under dynamic connectivity conditions.
  • Figure 3: Schematic of Cache and Storage System Interaction Between LLM and SLM. From left to right, the diagram illustrates the edge device storage and caching module, the data forwarding layer, and the cloud storage and caching module. During the execution of complex inference tasks, the edge SLM requests and downloads the optimized contextual KV cache from the cloud-based LLM.
  • Figure 4: Illustration of Network-Resilient LLM Inference via Cache Reuse and Token Generation in Edge SLMs.
  • Figure 5: Heatmaps of structural similarity between edge and cloud model layers. (Left) Centered Kernel Alignment (CKA) similarity scores. (Right) Representational Similarity Analysis (RSA) scores. Brighter areas indicate higher similarity.
  • ...and 2 more figures