Table of Contents
Fetching ...

Synera: Synergistic LLM Serving across Device and Cloud at Scale

Genglin Wang, Liekang Zeng, Bufang Yang, Kaiwei Liu, Guoliang Xing, Chumin Sun, Li Zhou, Jie Sun, Zhenyu Yan

TL;DR

Synera tackles the challenge of running large language models on mobile and edge devices by embracing a device-cloud synergy that offloads only quality-critical tokens from an on-device small language model (SLM) to a cloud LLM for verification. The core approach combines confidence- and importance-based token offloading with progressive early exits, stall-free parallel inference, and a verification-aware scheduler to maintain high generation quality while minimizing communication and cloud costs. Key contributions include the token-level synergy design, a practical offloading policy with tunable budgets, and a cloud-scheduling framework that supports continuous batching amid intermittent requests. Extensive evaluations on real-world mobile and edge testbeds show substantial gains in generation quality (up to 5.47x), favorable latency, and meaningful cloud-cost reductions, demonstrating Synera’s potential for scalable, energy-efficient, high-quality LLM serving at the edge.

Abstract

Large Language Models (LLMs) are becoming key components in various mobile operating systems, driving smart applications like interactive chatbots and personal assistants. While bringing enhanced intelligence to mobile ends, their deployment suffers from a set of performance challenges, especially the generation quality degradation and prolonged latency. Prior works have mainly relied on solutions of cloud offloading or on-device Small Language Models (SLMs). However, the former is usually limited by the communication bottleneck, and the latter sacrifices generation quality due to resource constraints. To mitigate these limitations, this paper proposes Synera, a device-cloud synergistic LLM serving system that applies an efficient SLM-LLM synergistic mechanism. Through empirical studies on LLM's unique computing characteristics, Synera identifies a set of underexplored optimization opportunities in device-cloud synergistic LLM inference, including offloading decisions, pipeline stalls, and batching bottlenecks. To translate them into enhanced performance, Synera introduces tailored designs of communication-efficient selective offloading, stall-free parallel inference, and scalable cloud batching. Extensive evaluations with real-world testbeds show that Synera enables 1.20-5.47x better generation quality against competitive baselines with on-par latency performance. Compared with existing cloud serving, Synera achieves 8.2-16.5% lower cloud serving cost on various benchmarks.

Synera: Synergistic LLM Serving across Device and Cloud at Scale

TL;DR

Synera tackles the challenge of running large language models on mobile and edge devices by embracing a device-cloud synergy that offloads only quality-critical tokens from an on-device small language model (SLM) to a cloud LLM for verification. The core approach combines confidence- and importance-based token offloading with progressive early exits, stall-free parallel inference, and a verification-aware scheduler to maintain high generation quality while minimizing communication and cloud costs. Key contributions include the token-level synergy design, a practical offloading policy with tunable budgets, and a cloud-scheduling framework that supports continuous batching amid intermittent requests. Extensive evaluations on real-world mobile and edge testbeds show substantial gains in generation quality (up to 5.47x), favorable latency, and meaningful cloud-cost reductions, demonstrating Synera’s potential for scalable, energy-efficient, high-quality LLM serving at the edge.

Abstract

Large Language Models (LLMs) are becoming key components in various mobile operating systems, driving smart applications like interactive chatbots and personal assistants. While bringing enhanced intelligence to mobile ends, their deployment suffers from a set of performance challenges, especially the generation quality degradation and prolonged latency. Prior works have mainly relied on solutions of cloud offloading or on-device Small Language Models (SLMs). However, the former is usually limited by the communication bottleneck, and the latter sacrifices generation quality due to resource constraints. To mitigate these limitations, this paper proposes Synera, a device-cloud synergistic LLM serving system that applies an efficient SLM-LLM synergistic mechanism. Through empirical studies on LLM's unique computing characteristics, Synera identifies a set of underexplored optimization opportunities in device-cloud synergistic LLM inference, including offloading decisions, pipeline stalls, and batching bottlenecks. To translate them into enhanced performance, Synera introduces tailored designs of communication-efficient selective offloading, stall-free parallel inference, and scalable cloud batching. Extensive evaluations with real-world testbeds show that Synera enables 1.20-5.47x better generation quality against competitive baselines with on-par latency performance. Compared with existing cloud serving, Synera achieves 8.2-16.5% lower cloud serving cost on various benchmarks.

Paper Structure

This paper contains 28 sections, 3 equations, 18 figures, 6 tables, 1 algorithm.

Figures (18)

  • Figure 1: Current LLM serving involves on-device SLM, cloud-centric LLM, and device-cloud synergy. Synera notably advances the accuracy-speedup frontier with significantly lower serving costs.
  • Figure 2: An illustration of LLM generation, such as the attention module, and confidence and importance score.
  • Figure 3: An illustration of the "Draft & verify" in speculative decoding.
  • Figure 4: The accuracy of the SLM’s predictions with respect to the LLM (left). The higher the probability a SLM assigns to a token, the more likely it will match the LLM's prediction. However, the tokens with high probability ($>$0.8) only account for 16.1% (right).
  • Figure 5: The SLM ranks tokens by importance score and offloads the top n% to the LLM, achieving sharp quality gains with only a 10–20% budget (left). The importance scores exhibit a long-tail distribution, where quality-critical tokens are scarce (right).
  • ...and 13 more figures