Table of Contents
Fetching ...

PPMI: Privacy-Preserving LLM Interaction with Socratic Chain-of-Thought Reasoning and Homomorphically Encrypted Vector Databases

Yubeen Bae, Minchan Kim, Jaejin Lee, Sangbum Kim, Jaehyung Kim, Yejin Choi, Niloofar Mireshghallah

TL;DR

This work tackles the privacy risk of sending private user data to cloud-based LLMs by proposing a four-stage, privacy-preserving framework that delegates non-private reasoning to a powerful external LLM while keeping sensitive data on a trusted local device. It introduces Socratic Chain-of-Thought Reasoning to decompose complex tasks into sub-queries, which are answered using a homomorphically encrypted vector database that supports secure, dynamic retrieval with sub-second latency. Key contributions include new inner-product optimizations for encrypted search (batching, caching, butterfly decomposition, and leading-term removal) and a practical API design enabling constant-time updates, with security guarantees based on CKKS and AES-256. Experiments on LoCoMo and MediQ show the hybrid approach significantly improves local baselines and approaches oracle baselines, while encrypted search maintains high accuracy and scales to 1M entries with minimal overhead, highlighting a viable path to private yet capable AI assistants. The results demonstrate that task decomposition across untrusted high-capacity LLMs and trusted light-weight local models can provide strong privacy without sacrificing performance, enabling real-world deployment of private personal AI assistants.

Abstract

Large language models (LLMs) are increasingly used as personal agents, accessing sensitive user data such as calendars, emails, and medical records. Users currently face a trade-off: They can send private records, many of which are stored in remote databases, to powerful but untrusted LLM providers, increasing their exposure risk. Alternatively, they can run less powerful models locally on trusted devices. We bridge this gap. Our Socratic Chain-of-Thought Reasoning first sends a generic, non-private user query to a powerful, untrusted LLM, which generates a Chain-of-Thought (CoT) prompt and detailed sub-queries without accessing user data. Next, we embed these sub-queries and perform encrypted sub-second semantic search using our Homomorphically Encrypted Vector Database across one million entries of a single user's private data. This represents a realistic scale of personal documents, emails, and records accumulated over years of digital activity. Finally, we feed the CoT prompt and the decrypted records to a local language model and generate the final response. On the LoCoMo long-context QA benchmark, our hybrid framework, combining GPT-4o with a local Llama-3.2-1B model, outperforms using GPT-4o alone by up to 7.1 percentage points. This demonstrates a first step toward systems where tasks are decomposed and split between untrusted strong LLMs and weak local ones, preserving user privacy.

PPMI: Privacy-Preserving LLM Interaction with Socratic Chain-of-Thought Reasoning and Homomorphically Encrypted Vector Databases

TL;DR

This work tackles the privacy risk of sending private user data to cloud-based LLMs by proposing a four-stage, privacy-preserving framework that delegates non-private reasoning to a powerful external LLM while keeping sensitive data on a trusted local device. It introduces Socratic Chain-of-Thought Reasoning to decompose complex tasks into sub-queries, which are answered using a homomorphically encrypted vector database that supports secure, dynamic retrieval with sub-second latency. Key contributions include new inner-product optimizations for encrypted search (batching, caching, butterfly decomposition, and leading-term removal) and a practical API design enabling constant-time updates, with security guarantees based on CKKS and AES-256. Experiments on LoCoMo and MediQ show the hybrid approach significantly improves local baselines and approaches oracle baselines, while encrypted search maintains high accuracy and scales to 1M entries with minimal overhead, highlighting a viable path to private yet capable AI assistants. The results demonstrate that task decomposition across untrusted high-capacity LLMs and trusted light-weight local models can provide strong privacy without sacrificing performance, enabling real-world deployment of private personal AI assistants.

Abstract

Large language models (LLMs) are increasingly used as personal agents, accessing sensitive user data such as calendars, emails, and medical records. Users currently face a trade-off: They can send private records, many of which are stored in remote databases, to powerful but untrusted LLM providers, increasing their exposure risk. Alternatively, they can run less powerful models locally on trusted devices. We bridge this gap. Our Socratic Chain-of-Thought Reasoning first sends a generic, non-private user query to a powerful, untrusted LLM, which generates a Chain-of-Thought (CoT) prompt and detailed sub-queries without accessing user data. Next, we embed these sub-queries and perform encrypted sub-second semantic search using our Homomorphically Encrypted Vector Database across one million entries of a single user's private data. This represents a realistic scale of personal documents, emails, and records accumulated over years of digital activity. Finally, we feed the CoT prompt and the decrypted records to a local language model and generate the final response. On the LoCoMo long-context QA benchmark, our hybrid framework, combining GPT-4o with a local Llama-3.2-1B model, outperforms using GPT-4o alone by up to 7.1 percentage points. This demonstrates a first step toward systems where tasks are decomposed and split between untrusted strong LLMs and weak local ones, preserving user privacy.

Paper Structure

This paper contains 47 sections, 14 equations, 7 figures, 8 tables, 8 algorithms.

Figures (7)

  • Figure 1: Overview of our hybrid framework. Upon receiving a query, a remote LLM generates a Chain-of-Thought (CoT) prompt and sub-queries (Stage 1) which are embedded locally (Stage 2), and used for our encrypted vector search on a remote database (Stage 3). Retrieved records are decrypted and provided with the CoT prompt as context to a local model to generate the final response (Stage 4).
  • Figure 2: Multi-thread search latency (using 64 threads) breakdown on the Deep1B deep1b dataset as the number of database entries increases. Red and pink bars represent network communication time on fast and slow networks, respectively, while the numbers above each bar indicate the corresponding latency. Blue bars represent query caching time; light-blue bars show query-key multiplication time. Takeaway: Our encrypted search scales to 1M entries with < 1 second latency, as homomorphic operations incur relatively low overhead compared to network communication.
  • Figure 3: Prompt used for sub-query generation in both the baselines and the socratic chain-of-thought reasoning.
  • Figure 4: Prompt used for response generation in the baselines.
  • Figure 5: Prompt used for chain-of-thought generation in the socratic chain-of-thought reasoning.
  • ...and 2 more figures