Table of Contents
Fetching ...

Privacy Guard & Token Parsimony by Prompt and Context Handling and LLM Routing

Alessio Langiu

Abstract

The large-scale adoption of Large Language Models (LLMs) forces a trade-off between operational cost (OpEx) and data privacy. Current routing frameworks reduce costs but ignore prompt sensitivity, exposing users and institutions to leakage risks towards third-party cloud providers. We formalise the "Inseparability Paradigm": advanced context management intrinsically coincides with privacy management. We propose a local "Privacy Guard" -- a holistic contextual observer powered by an on-premise Small Language Model (SLM) -- that performs abstractive summarisation and Automatic Prompt Optimisation (APO) to decompose prompts into focused sub-tasks, re-routing high-risk queries to Zero-Trust or NDA-covered models. This dual mechanism simultaneously eliminates sensitive inference vectors (Zero Leakage) and reduces cloud token payloads (OpEx Reduction). A LIFO-based context compacting mechanism further bounds working memory, limiting the emergent leakage surface. We validate the framework through a 2x2 benchmark (Lazy vs. Expert users; Personal vs. Institutional secrets) on a 1,000-sample dataset, achieving a 45% blended OpEx reduction, 100% redaction success on personal secrets, and -- via LLM-as-a-Judge evaluation -- an 85% preference rate for APO-compressed responses over raw baselines. Our results demonstrate that Token Parsimony and Zero Leakage are mathematically dual projections of the same contextual compression operator.

Privacy Guard & Token Parsimony by Prompt and Context Handling and LLM Routing

Abstract

The large-scale adoption of Large Language Models (LLMs) forces a trade-off between operational cost (OpEx) and data privacy. Current routing frameworks reduce costs but ignore prompt sensitivity, exposing users and institutions to leakage risks towards third-party cloud providers. We formalise the "Inseparability Paradigm": advanced context management intrinsically coincides with privacy management. We propose a local "Privacy Guard" -- a holistic contextual observer powered by an on-premise Small Language Model (SLM) -- that performs abstractive summarisation and Automatic Prompt Optimisation (APO) to decompose prompts into focused sub-tasks, re-routing high-risk queries to Zero-Trust or NDA-covered models. This dual mechanism simultaneously eliminates sensitive inference vectors (Zero Leakage) and reduces cloud token payloads (OpEx Reduction). A LIFO-based context compacting mechanism further bounds working memory, limiting the emergent leakage surface. We validate the framework through a 2x2 benchmark (Lazy vs. Expert users; Personal vs. Institutional secrets) on a 1,000-sample dataset, achieving a 45% blended OpEx reduction, 100% redaction success on personal secrets, and -- via LLM-as-a-Judge evaluation -- an 85% preference rate for APO-compressed responses over raw baselines. Our results demonstrate that Token Parsimony and Zero Leakage are mathematically dual projections of the same contextual compression operator.

Paper Structure

This paper contains 34 sections, 1 equation, 8 figures.

Figures (8)

  • Figure 1: The Dual-Vault Architecture. A local SLM acts as an Orchestrator and Privacy Guard, decoupling intents from secrets and routing the sanitised sub-tasks to the appropriate Cloud Tier based on trust and risk levels.
  • Figure 2: Token Parsimony vs Data Leakage by user profile and secret typology. The model achieves high Token Parsimony across the board but suffers from severe Leakage strictly in the Lazy/Institutional quadrant.
  • Figure 3: Exploratory space analysis of Leakage Rate across parameter scales (8B to 104B). Results are based on limited sample sizes designed to identify structural vulnerabilities rather than provide exhaustive statistical baselines.
  • Figure 4: Projected vs actual OpEx reduction when comparing a standard monolithic Cloud API call against active local prompt decomposition and selective routing.
  • Figure 5: LLM-as-a-Judge comparison of final response quality between raw Context (Baseline) and compressed/sanitised Context (Dual-Vault APO).
  • ...and 3 more figures