Table of Contents
Fetching ...

A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models

Zuan Xie, Yang Xu, Hongli Xu, Yunming Liao, Zhiwei Yao

TL;DR

This work tackles the latency and privacy limitations of cloud-only LLM services by introducing HAT, a device-cloud collaborative framework that fuses U-shaped inference with speculative decoding. HAT keeps input and output submodels on edge devices and places the middle, compute-heavy submodel in the cloud, aided by a lightweight on-device draft model and prompt chunking to reduce long-prompt delays. On a hardware prototype with 30 Jetson devices and 8 GPUs, HAT achieves notable latency improvements, reducing time-to-first-token by 41%–54% and time-between-tokens by 41%–77% relative to strong baselines, while preserving privacy. The approach demonstrates practical potential for low-latency, privacy-preserving LLM services in edge-cloud deployments, enabling efficient, scalable deployments in privacy-sensitive applications.

Abstract

Recent advancements in large language models (LLMs) have catalyzed a substantial surge in demand for LLM services. While traditional cloud-based LLM services satisfy high-accuracy requirements, they fall short in meeting critical demands for low delay and enhanced privacy. To address these limitations, we propose HAT, a novel device-cloud collaborative inference framework that leverages the complementary strengths of U-shaped inference and speculative decoding. HAT partitions the LLM into three submodels, and the input and output submodels, stacked with a lightweight adapter network, are deployed as a small language model (SLM) on each end device. Meanwhile, the middle submodel, encompassing the majority of the LLM's decoder layers, is hosted in the cloud to perform speculative decoding with on-device SLMs. During inference, HAT exchanges hidden states (rather than raw tokens) of input or draft tokens between devices and the cloud, thereby incurring substantial communication delays. Besides, processing hidden states of long prompts will exacerbate computation delays in the cloud, further compromising inference efficiency. To improve efficiency, we introduce a prompt chunking mechanism that segments long prompts into shorter chunks, enabling parallel transmission and processing. Furthermore, HAT is implemented to dynamically determine optimal chunk sizes for devices handling long prompts, thereby improving overall inference speed. Extensive experiments are conducted on a physical testbed comprising 30 NVIDIA Jetson devices and a server with 8 NVIDIA A6000 GPUs. Experimental results demonstrate that HAT achieves promising performance improvements, reducing TTFT by 41% to 54% and TBT by 41% to 77% compared to the baselines.

A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models

TL;DR

This work tackles the latency and privacy limitations of cloud-only LLM services by introducing HAT, a device-cloud collaborative framework that fuses U-shaped inference with speculative decoding. HAT keeps input and output submodels on edge devices and places the middle, compute-heavy submodel in the cloud, aided by a lightweight on-device draft model and prompt chunking to reduce long-prompt delays. On a hardware prototype with 30 Jetson devices and 8 GPUs, HAT achieves notable latency improvements, reducing time-to-first-token by 41%–54% and time-between-tokens by 41%–77% relative to strong baselines, while preserving privacy. The approach demonstrates practical potential for low-latency, privacy-preserving LLM services in edge-cloud deployments, enabling efficient, scalable deployments in privacy-sensitive applications.

Abstract

Recent advancements in large language models (LLMs) have catalyzed a substantial surge in demand for LLM services. While traditional cloud-based LLM services satisfy high-accuracy requirements, they fall short in meeting critical demands for low delay and enhanced privacy. To address these limitations, we propose HAT, a novel device-cloud collaborative inference framework that leverages the complementary strengths of U-shaped inference and speculative decoding. HAT partitions the LLM into three submodels, and the input and output submodels, stacked with a lightweight adapter network, are deployed as a small language model (SLM) on each end device. Meanwhile, the middle submodel, encompassing the majority of the LLM's decoder layers, is hosted in the cloud to perform speculative decoding with on-device SLMs. During inference, HAT exchanges hidden states (rather than raw tokens) of input or draft tokens between devices and the cloud, thereby incurring substantial communication delays. Besides, processing hidden states of long prompts will exacerbate computation delays in the cloud, further compromising inference efficiency. To improve efficiency, we introduce a prompt chunking mechanism that segments long prompts into shorter chunks, enabling parallel transmission and processing. Furthermore, HAT is implemented to dynamically determine optimal chunk sizes for devices handling long prompts, thereby improving overall inference speed. Extensive experiments are conducted on a physical testbed comprising 30 NVIDIA Jetson devices and a server with 8 NVIDIA A6000 GPUs. Experimental results demonstrate that HAT achieves promising performance improvements, reducing TTFT by 41% to 54% and TBT by 41% to 77% compared to the baselines.

Paper Structure

This paper contains 23 sections, 6 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Results of Preliminary Experiments.
  • Figure 2: Illustration of HAT.
  • Figure 3: System overview of HAT.
  • Figure 4: Illustration of Prompt Chunking.
  • Figure 5: Illustration of Parallel Drafting.
  • ...and 7 more figures