Pushing Large Language Models to the 6G Edge: Vision, Challenges, and Opportunities

Zheng Lin; Guanqiao Qu; Qiyuan Chen; Xianhao Chen; Zhe Chen; Kaibin Huang

Pushing Large Language Models to the 6G Edge: Vision, Challenges, and Opportunities

Zheng Lin, Guanqiao Qu, Qiyuan Chen, Xianhao Chen, Zhe Chen, Kaibin Huang

TL;DR

Addressing the latency, bandwidth, and privacy bottlenecks of cloud-only LLM deployment, this paper advocates pushing LLMs to the 6G mobile edge through end-edge cooperation. It presents a 6G MEC architecture comprising cloud-edge-user synergy, in-network model splitting, and parameter-sharing edge caching, paired with end-edge training and inference techniques. The authors review Split Parameter-efficient Fine-tuning (SplitPEFT), split learning, and various inference strategies (KV caching, MoE, SLM-LLM collaboration) to fit edge constraints. They also discuss open problems in green computing and privacy, and outline a roadmap for practical, privacy-preserving, low-latency LLM deployment at the edge.

Abstract

Large language models (LLMs), which have shown remarkable capabilities, are revolutionizing AI development and potentially shaping our future. However, given their multimodality, the status quo cloud-based deployment faces some critical challenges: 1) long response time; 2) high bandwidth costs; and 3) the violation of data privacy. 6G mobile edge computing (MEC) systems may resolve these pressing issues. In this article, we explore the potential of deploying LLMs at the 6G edge. We start by introducing killer applications powered by multimodal LLMs, including robotics and healthcare, to highlight the need for deploying LLMs in the vicinity of end users. Then, we identify the critical challenges for LLM deployment at the edge and envision the 6G MEC architecture for LLMs. Furthermore, we delve into two design aspects, i.e., edge training and edge inference for LLMs. In both aspects, considering the inherent resource limitations at the edge, we discuss various cutting-edge techniques, including split learning/inference, parameter-efficient fine-tuning, quantization, and parameter-sharing inference, to facilitate the efficient deployment of LLMs. This article serves as a position paper for thoroughly identifying the motivation, challenges, and pathway for empowering LLMs at the 6G edge.

Pushing Large Language Models to the 6G Edge: Vision, Challenges, and Opportunities

TL;DR

Abstract

Paper Structure (24 sections, 5 figures)

This paper contains 24 sections, 5 figures.

Introduction
Killer Applications: The Needs for Deployment at the Edge
Limitations for On-device LLM Deployment
6G MEC for Large Language Models: An Overview
Cloud-edge-user synergy
In-network model splitting
Parameter-sharing edge model caching and delivery
End-edge Large Model Fine-tuning
Split Parameter-efficient Fine-tuning (SplitPEFT)
Split learning
Integrating SL and parameter-efficient fine-tuning
Towards efficient split PEFT
Parallelism of Edge LLM Training
End-edge Large Model Inference
Split inference for LLMs
...and 9 more sections

Figures (5)

Figure 1: The MEC architecture for large language models in 6G.
Figure 2: An illustration of the transformer architecture and several state-of-the-art PEFT methods, including adapter tuning, prompt tuning, and low-rank adaptation.
Figure 3: The performance of SplitLoRA lin2024splitlora for training latency and bilingual evaluation understudy (BLEU, a metric for evaluating the machine translations against the human translations) versus the freezing ratio, where LoRA is employed to fine-tune GPT-2 medium on WebText dataset. An edge server and $20$ clients are considered. Computing capabilities of clients and the edge server are set to 3.56 and 35.6 (peak performance of one NVIDIA RTX 3090) TFLOPS, uplink and downlink rates are 70Mbps and 300Mbps, and the number of tokens utilized for training is 264M.
Figure 4: An illustration of multi-hop SL with pipeline parallelism. Multiple clients jointly train a large model based on SL approaches, such as SFL and PSL. The model is partitioned into multiple parts so that the total workload is shared among multiple edge servers.
Figure 5: The training latency of multi-hop SL versus the hop counts, where LoRA is employed to fine-tune GPT-2 medium on WebText dataset. The data samples are distributed over $5$ clients, the transmission rate between edge servers is 400Mbps, and other key parameters are consistent with Fig. \ref{['finetune_latency_acc']}.

Pushing Large Language Models to the 6G Edge: Vision, Challenges, and Opportunities

TL;DR

Abstract

Pushing Large Language Models to the 6G Edge: Vision, Challenges, and Opportunities

Authors

TL;DR

Abstract

Table of Contents

Figures (5)