Table of Contents
Fetching ...

Collaborative Device-Cloud LLM Inference through Reinforcement Learning

Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Christopher Brinton

TL;DR

This work tackles the efficiency–accuracy trade-off in device–cloud LLM inference by enabling on-device models to autonomously decide cloud offloading through reinforcement learning. It introduces a unified post-training framework with a collaboration-aware hierarchical reward and a Group-Adaptive Policy Gradient (GAPG) algorithm that uses a group-level gradient estimator and adaptive prompt filtering to jointly optimize local reasoning and cloud invocation under a cloud-usage budget. Empirical results on symbolic and mathematical reasoning tasks demonstrate that the approach consistently outperforms baselines and substantially narrows the gap to full cloud LLM performance, while maintaining stable training and favorable call budgets. The method advances practical device–cloud collaboration by eliminating external routers and enabling end-to-end optimization of both reasoning and routing in a resource-constrained setting.

Abstract

Device-cloud collaboration has emerged as a promising paradigm for deploying large language models (LLMs), combining the efficiency of lightweight on-device inference with the superior performance of powerful cloud LLMs. An essential problem in this scenario lies in deciding whether a given query is best handled locally or delegated to the cloud. Existing approaches typically rely on external routers, implemented as binary classifiers, which often struggle to determine task difficulty from the prompt's surface pattern. To address these limitations, we propose a framework where the on-device LLM makes routing decisions at the end of its solving process, with this capability instilled through post-training. In particular, we formulate a reward maximization problem with carefully designed rewards that encourage effective problem solving and judicious offloading to the cloud. To solve this problem, we develop a group-adaptive policy gradient algorithm, featuring a group-level policy gradient, designed to yield an unbiased gradient estimator of the reward, and adaptive prompt filtering, developed to enforce the constraint on cloud LLM usage. Extensive experiments across models and benchmarks show that the proposed methodology consistently outperforms existing baselines and significantly narrows the gap to full cloud LLM performance.

Collaborative Device-Cloud LLM Inference through Reinforcement Learning

TL;DR

This work tackles the efficiency–accuracy trade-off in device–cloud LLM inference by enabling on-device models to autonomously decide cloud offloading through reinforcement learning. It introduces a unified post-training framework with a collaboration-aware hierarchical reward and a Group-Adaptive Policy Gradient (GAPG) algorithm that uses a group-level gradient estimator and adaptive prompt filtering to jointly optimize local reasoning and cloud invocation under a cloud-usage budget. Empirical results on symbolic and mathematical reasoning tasks demonstrate that the approach consistently outperforms baselines and substantially narrows the gap to full cloud LLM performance, while maintaining stable training and favorable call budgets. The method advances practical device–cloud collaboration by eliminating external routers and enabling end-to-end optimization of both reasoning and routing in a resource-constrained setting.

Abstract

Device-cloud collaboration has emerged as a promising paradigm for deploying large language models (LLMs), combining the efficiency of lightweight on-device inference with the superior performance of powerful cloud LLMs. An essential problem in this scenario lies in deciding whether a given query is best handled locally or delegated to the cloud. Existing approaches typically rely on external routers, implemented as binary classifiers, which often struggle to determine task difficulty from the prompt's surface pattern. To address these limitations, we propose a framework where the on-device LLM makes routing decisions at the end of its solving process, with this capability instilled through post-training. In particular, we formulate a reward maximization problem with carefully designed rewards that encourage effective problem solving and judicious offloading to the cloud. To solve this problem, we develop a group-adaptive policy gradient algorithm, featuring a group-level policy gradient, designed to yield an unbiased gradient estimator of the reward, and adaptive prompt filtering, developed to enforce the constraint on cloud LLM usage. Extensive experiments across models and benchmarks show that the proposed methodology consistently outperforms existing baselines and significantly narrows the gap to full cloud LLM performance.

Paper Structure

This paper contains 33 sections, 2 theorems, 62 equations, 7 figures, 6 tables.

Key Result

Proposition 3.1

Given a prompt $\boldsymbol{x}$, draw a group of $G$ responses $\{\boldsymbol{y}_1,\dots,\boldsymbol{y}_G\}$, where each response $\boldsymbol{y}_i$ may be produced entirely by the on-device policy $\pi_\theta$ (i.e. $\boldsymbol{y}_i=\boldsymbol{y}_i^\theta$) or jointly with the cloud policy $\pi_c

Figures (7)

  • Figure 1: An illustration of our proposed RL-based unified training methodology and collaborative inference framework. (a) Training Framework: Two main scenarios where the lightweight on-device LLM learns to either solve problems independently or call for help. Note that the on-device LLM is trained offline before deployment on devices. (b) Collaborative Inference: The on-device LLM autonomously determines whether to process queries locally or invoke the cloud LLM. Collaborative reasoning
  • Figure 2: Training reward and testing accuracy on the Countdown task with Qwen2.5-3B-Instruct. Our method consistently outperforms baselines, achieving higher rewards and accuracy.
  • Figure 3: Testing accuracy versus training iterations on the MATH-lighteval dataset. Our method consistently outperforms baselines across three on-device models, while also exhibiting stable training behavior, demonstrating its effectiveness and robustness.
  • Figure 4: Impact of call-for-cloud ratio on accuracy. Our approach rapidly narrows the gap to Cloud LLM as the ratio increases
  • Figure 5: Rewards and call-for-cloud ratios over training iterations. The reward converges to the coordination reward, while the on-device LLM collapses to a degenerate policy that always invokes the cloud model ($100 \%$ call-for-cloud ratio).
  • ...and 2 more figures

Theorems & Definitions (4)

  • Proposition 3.1: Group-level Policy Gradient Estimator
  • Remark 3.2: Variance property
  • Remark 3.3: Comparison between GAPG and GRPO
  • Lemma 4.1