Privacy-preserved LLM Cascade via CoT-enhanced Policy Learning

Kai Zhang; Congchao Wang; Liqian Peng; Alec Go; Xiaozhong Liu

Privacy-preserved LLM Cascade via CoT-enhanced Policy Learning

Kai Zhang, Congchao Wang, Liqian Peng, Alec Go, Xiaozhong Liu

TL;DR

This work tackles privacy in on-device LLM cascades by introducing $P^{3}Defer$, a CoT-enhanced policy learning framework that treats deferral as a trainable agent with three actions: accept local output, defer to a server LLM, or mask private tokens via a private memory. Local LLMs are trained with Chain-of-Thought guided instruction tuning and knowledge distillation from the server, while a private memory hides sensitive tokens to mitigate leakage. The approach is evaluated on three privacy-sensitive tasks (mathematical QA, medical summarization, email summarization), showing state-of-the-art cascade performance and substantially reduced privacy leakage, with high safe call rates. The results demonstrate that policy learning combined with a privacy-preserving memory enables more reliable, efficient, and secure on-device LLM cascades, bringing practical impact to real-world applications.

Abstract

Large Language Models (LLMs) have gained significant attention in on-device applications due to their remarkable performance across real-world tasks. However, on-device LLMs often suffer from suboptimal performance due to hardware limitations. A promising solution to this challenge is cascading a weaker local (on-device) LLM with a more powerful server LLM. While existing research on LLM cascade primarily optimizes the performance-cost trade-off, real-world applications impose additional requirements, such as privacy preservation, which remain largely unaddressed. In this work, we move beyond existing confidence- and logit-based LLM cascade methods and propose $\mathbf{P^{3}Defer}$, a novel Chain-of-Thought (CoT)-enhanced \textbf{p}olicy learning framework for \textbf{p}rivacy-\textbf{p}reserved \textbf{defer}ral decision-making. Our approach effectively improves cascade efficiency while mitigating privacy risks. Extensive experiments on three benchmark datasets demonstrate the effectiveness and superiority of $\mathbf{P^{3}Defer}$ over existing methods.

Privacy-preserved LLM Cascade via CoT-enhanced Policy Learning

TL;DR

This work tackles privacy in on-device LLM cascades by introducing

, a CoT-enhanced policy learning framework that treats deferral as a trainable agent with three actions: accept local output, defer to a server LLM, or mask private tokens via a private memory. Local LLMs are trained with Chain-of-Thought guided instruction tuning and knowledge distillation from the server, while a private memory hides sensitive tokens to mitigate leakage. The approach is evaluated on three privacy-sensitive tasks (mathematical QA, medical summarization, email summarization), showing state-of-the-art cascade performance and substantially reduced privacy leakage, with high safe call rates. The results demonstrate that policy learning combined with a privacy-preserving memory enables more reliable, efficient, and secure on-device LLM cascades, bringing practical impact to real-world applications.

Abstract

, a novel Chain-of-Thought (CoT)-enhanced \textbf{p}olicy learning framework for \textbf{p}rivacy-\textbf{p}reserved \textbf{defer}ral decision-making. Our approach effectively improves cascade efficiency while mitigating privacy risks. Extensive experiments on three benchmark datasets demonstrate the effectiveness and superiority of

over existing methods.

Privacy-preserved LLM Cascade via CoT-enhanced Policy Learning

TL;DR

Abstract

Privacy-preserved LLM Cascade via CoT-enhanced Policy Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)