Table of Contents
Fetching ...

Privacy-preserved LLM Cascade via CoT-enhanced Policy Learning

Kai Zhang, Congchao Wang, Liqian Peng, Alec Go, Xiaozhong Liu

TL;DR

This work tackles privacy in on-device LLM cascades by introducing $P^{3}Defer$, a CoT-enhanced policy learning framework that treats deferral as a trainable agent with three actions: accept local output, defer to a server LLM, or mask private tokens via a private memory. Local LLMs are trained with Chain-of-Thought guided instruction tuning and knowledge distillation from the server, while a private memory hides sensitive tokens to mitigate leakage. The approach is evaluated on three privacy-sensitive tasks (mathematical QA, medical summarization, email summarization), showing state-of-the-art cascade performance and substantially reduced privacy leakage, with high safe call rates. The results demonstrate that policy learning combined with a privacy-preserving memory enables more reliable, efficient, and secure on-device LLM cascades, bringing practical impact to real-world applications.

Abstract

Large Language Models (LLMs) have gained significant attention in on-device applications due to their remarkable performance across real-world tasks. However, on-device LLMs often suffer from suboptimal performance due to hardware limitations. A promising solution to this challenge is cascading a weaker local (on-device) LLM with a more powerful server LLM. While existing research on LLM cascade primarily optimizes the performance-cost trade-off, real-world applications impose additional requirements, such as privacy preservation, which remain largely unaddressed. In this work, we move beyond existing confidence- and logit-based LLM cascade methods and propose $\mathbf{P^{3}Defer}$, a novel Chain-of-Thought (CoT)-enhanced \textbf{p}olicy learning framework for \textbf{p}rivacy-\textbf{p}reserved \textbf{defer}ral decision-making. Our approach effectively improves cascade efficiency while mitigating privacy risks. Extensive experiments on three benchmark datasets demonstrate the effectiveness and superiority of $\mathbf{P^{3}Defer}$ over existing methods.

Privacy-preserved LLM Cascade via CoT-enhanced Policy Learning

TL;DR

This work tackles privacy in on-device LLM cascades by introducing , a CoT-enhanced policy learning framework that treats deferral as a trainable agent with three actions: accept local output, defer to a server LLM, or mask private tokens via a private memory. Local LLMs are trained with Chain-of-Thought guided instruction tuning and knowledge distillation from the server, while a private memory hides sensitive tokens to mitigate leakage. The approach is evaluated on three privacy-sensitive tasks (mathematical QA, medical summarization, email summarization), showing state-of-the-art cascade performance and substantially reduced privacy leakage, with high safe call rates. The results demonstrate that policy learning combined with a privacy-preserving memory enables more reliable, efficient, and secure on-device LLM cascades, bringing practical impact to real-world applications.

Abstract

Large Language Models (LLMs) have gained significant attention in on-device applications due to their remarkable performance across real-world tasks. However, on-device LLMs often suffer from suboptimal performance due to hardware limitations. A promising solution to this challenge is cascading a weaker local (on-device) LLM with a more powerful server LLM. While existing research on LLM cascade primarily optimizes the performance-cost trade-off, real-world applications impose additional requirements, such as privacy preservation, which remain largely unaddressed. In this work, we move beyond existing confidence- and logit-based LLM cascade methods and propose , a novel Chain-of-Thought (CoT)-enhanced \textbf{p}olicy learning framework for \textbf{p}rivacy-\textbf{p}reserved \textbf{defer}ral decision-making. Our approach effectively improves cascade efficiency while mitigating privacy risks. Extensive experiments on three benchmark datasets demonstrate the effectiveness and superiority of over existing methods.

Paper Structure

This paper contains 28 sections, 10 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: On the left is the existing LLM cascade, where the deferral module makes decisions solely based on the quality of the local answer, potentially leading to privacy leakage. On the right is the privacy-preserved LLM cascade, where deferral decisions are more aligned with the needs of real-world applications.
  • Figure 2: Overview of the proposed $\mathbf{P^{3}Defer}$ framework. Given a user query $x$, the local model $\Phi(L)$ generates a response $y^L$. The agent $\mathcal{A}$ decides among three actions based on the state $s_t$: (1) return $y^L$, (2) defer to the server model $\Phi(S)$ for response $y^S$, or (3) mask private tokens via private memory. The agent is trained via reinforcement learning, where the reward function $\mathcal{R}$ evaluates response quality and privacy, and the critic function $\mathcal{C}$ assesses long-term decision-making.
  • Figure 3: Inference process of the proposed framework. The Deferral Agent determines whether to return the local response, defer to the server LLM, or apply privacy masking based on the input query.
  • Figure 4: Curves depicting cascade performance versus call rate for different methods across all three datasets: (a) GSM8K, (b) MedSum, and (c) EmailSum.
  • Figure 5: Ablation study on CoT usage.
  • ...and 6 more figures