CE-LSLM: Efficient Large-Small Language Model Inference and Communication via Cloud-Edge Collaboration
Pengyan Zhu, Tingting Yang
TL;DR
The paper tackles latency and privacy challenges in edge inference for 6G by proposing CE-LSLM, a cloud–edge collaborative framework that delegates long-context processing to a cloud LLM while delegating lightweight, local decoding to edge SLMs. It introduces dynamic KV cache sharing across cloud–edge and peer-edge, layer-alignment via layer similarity metrics, and dimensionality reduction for attention caches, all coordinated by a pipelined compute–load scheduling approach. Key contributions include the hierarchical cache sharing framework, dynamic cache management, layer-wise KV sharing strategies, and a pipeline-based latency reduction mechanism, demonstrated to improve latency and throughput while preserving data privacy. These methods enable scalable, private, multi-tenant edge deployments for 6G intelligent services, with robustness to network disruptions and concurrent workloads. $T_{total}=T_{com_C}+T_{comm}+T_{com_E}$ serves as the core latency objective guiding the design and evaluation.
Abstract
Emerging intelligent service scenarios in 6G communication impose stringent requirements for low latency, high reliability, and privacy preservation. Generative large language models (LLMs) are gradually becoming key enablers for the integration of semantic communication and computation. However, due to the limited computational resources of edge devices and the increasing complexity of heterogeneous terminal access, existing centralized inference approaches fail to meet the dual demands of response efficiency and data privacy in edge-side inference tasks. To address these challenges, this paper proposes a novel collaborative inference architecture that integrates cloud-based LLMs with edge-deployed small language models (SLMs), enabling dynamic scheduling and sharing of semantic-level intermediate states, and establishing a unified computation-communication paradigm tailored for 6G networks. Specifically, a key-value (KV) cache reuse mechanism is introduced to enhance the semantic understanding of edge models through contextual guidance from the cloud, while significantly reducing edge-side computational and storage overhead. Furthermore, a cross-node parallel scheduling mechanism is proposed to achieve asynchronous coordination between model state loading and decoding computation, thereby improving edge responsiveness. In addition, we investigate layer alignment and representation compression strategies between heterogeneous models to alleviate the communication burden on the edge. Experimental results demonstrate that the proposed architecture exhibits superior adaptability and scalability in terms of inference latency, system stability, and concurrent processing capacity.
