Table of Contents
Fetching ...

Whistledown: Combining User-Level Privacy with Conversational Coherence in LLMs

Chelsea McMurray, Hayder Tirmazi

TL;DR

Whistledown addresses the privacy risks of using cloud-hosted LLMs for sensitive conversations by introducing a best-effort privacy layer that operates on either the user’s device or a zero-trust enterprise gateway. It combines pseudonymization with $ε$-local differential privacy ($ε$-LDP) and a resource-aware transformation strategy, including a consistent per-session token mapping to maintain conversational coherence. The work details two deployment modes (Whistledown-Device and Whistledown-Gateway), core techniques (Bloom-filter filtering, GLiNER NER, on-device BERT-TINY embeddings, and gender-aware or embedding-based transformations), and a comprehensive evaluation showing realistic latency within interactive ranges. The approach provides direct PII protection, optional demographic privacy, and long-term privacy through per-session coherence plus DP guarantees, offering a practical privacy-preserving path for deployments of conversational AI in both personal and enterprise settings.

Abstract

Users increasingly rely on large language models (LLMs) for personal, emotionally charged, and socially sensitive conversations. However, prompts sent to cloud-hosted models can contain personally identifiable information (PII) that users do not want logged, retained, or leaked. We observe this to be especially acute when users discuss friends, coworkers, or adversaries, i.e., when they spill the tea. Enterprises face the same challenge when they want to use LLMs for internal communication and decision-making. In this whitepaper, we present Whistledown, a best-effort privacy layer that modifies prompts before they are sent to the LLM. Whistledown combines pseudonymization and $ε$-local differential privacy ($ε$-LDP) with transformation caching to provide best-effort privacy protection without sacrificing conversational utility. Whistledown is designed to have low compute and memory overhead, allowing it to be deployed directly on a client's device in the case of individual users. For enterprise users, Whistledown is deployed centrally within a zero-trust gateway that runs on an enterprise's trusted infrastructure. Whistledown requires no changes to the existing APIs of popular LLM providers.

Whistledown: Combining User-Level Privacy with Conversational Coherence in LLMs

TL;DR

Whistledown addresses the privacy risks of using cloud-hosted LLMs for sensitive conversations by introducing a best-effort privacy layer that operates on either the user’s device or a zero-trust enterprise gateway. It combines pseudonymization with -local differential privacy (-LDP) and a resource-aware transformation strategy, including a consistent per-session token mapping to maintain conversational coherence. The work details two deployment modes (Whistledown-Device and Whistledown-Gateway), core techniques (Bloom-filter filtering, GLiNER NER, on-device BERT-TINY embeddings, and gender-aware or embedding-based transformations), and a comprehensive evaluation showing realistic latency within interactive ranges. The approach provides direct PII protection, optional demographic privacy, and long-term privacy through per-session coherence plus DP guarantees, offering a practical privacy-preserving path for deployments of conversational AI in both personal and enterprise settings.

Abstract

Users increasingly rely on large language models (LLMs) for personal, emotionally charged, and socially sensitive conversations. However, prompts sent to cloud-hosted models can contain personally identifiable information (PII) that users do not want logged, retained, or leaked. We observe this to be especially acute when users discuss friends, coworkers, or adversaries, i.e., when they spill the tea. Enterprises face the same challenge when they want to use LLMs for internal communication and decision-making. In this whitepaper, we present Whistledown, a best-effort privacy layer that modifies prompts before they are sent to the LLM. Whistledown combines pseudonymization and -local differential privacy (-LDP) with transformation caching to provide best-effort privacy protection without sacrificing conversational utility. Whistledown is designed to have low compute and memory overhead, allowing it to be deployed directly on a client's device in the case of individual users. For enterprise users, Whistledown is deployed centrally within a zero-trust gateway that runs on an enterprise's trusted infrastructure. Whistledown requires no changes to the existing APIs of popular LLM providers.

Paper Structure

This paper contains 16 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: The life of a prompt in a Whistledown-Device deployment. All transformations are performed on a user's device before the prompt is sent to the LLM. A consistent mapping between plaintext tokens and modified tokens is maintained to ensure conversational coherence. This ensures that the user's experience while conversing with the LLM is not disrupted with Whistledown enabled.
  • Figure 2: In a Whistledown-Gateway deployment, all transformations are performed on an enterprise's trusted infrastructure before the prompt is sent to the LLM. Dorcha Gateway maintains per-user mappings and its users may include non-human identities such as LLM agents and service accounts. The privacy controls are configured centrally by the enterprise.
  • Figure 3: Per-sentence TTFT overhead for Whistledown transformations without embedding-based LDP, across four sentence configurations. Overhead scales approximately linearly with sentence length, ranging from 5-47 ms for 1-2 word sentences to 1.2-1.4 seconds for 59-word sentences. All configurations show similar scaling, indicating entity detection (GLiNER) dominates overhead. Error bars represent standard deviation over 5 iterations.
  • Figure 4: Per-sentence TTFT overhead comparison for name transformations with and without embedding-based LDP. Surprisingly, both approaches show nearly identical overhead across all sentence lengths (within 10% of each other), indicating that BERT-tiny inference adds negligible cost compared to GLiNER-based entity detection. This suggests embedding-based LDP is practical even for latency-sensitive applications. Error bars represent standard deviation over 5 iterations.