Table of Contents
Fetching ...

Privacy-Aware RAG: Secure and Isolated Knowledge Retrieval

Pengcheng Zhou, Yinglun Feng, Zhongliang Yang

TL;DR

The paper tackles privacy risks in Retrieval-Augmented Generation (RAG) by introducing an encryption-first framework that protects both textual content and embeddings. It presents two schemes: Method A, AES-CBC-based encryption with per-user keys $K_i$, and Method B, a chained dynamic key derivation approach with root keys $K_{A,i}$ and hash-based integrity, all under a user-isolated access model. Security proofs connect Method A to IND-CPA confidentiality and INT-CTXT via HMAC, while Method B leverages HKDF-based forward security, chain integrity, and trapdoor secrecy, with security guarantees that scale when the parameter $ ext{lambda} \\ge 128$. The framework preserves RAG performance, supports cross-domain deployment, and advocates for stricter data-protection standards in AI-driven services.

Abstract

The widespread adoption of Retrieval-Augmented Generation (RAG) systems in real-world applications has heightened concerns about the confidentiality and integrity of their proprietary knowledge bases. These knowledge bases, which play a critical role in enhancing the generative capabilities of Large Language Models (LLMs), are increasingly vulnerable to breaches that could compromise sensitive information. To address these challenges, this paper proposes an advanced encryption methodology designed to protect RAG systems from unauthorized access and data leakage. Our approach encrypts both textual content and its corresponding embeddings prior to storage, ensuring that all data remains securely encrypted. This mechanism restricts access to authorized entities with the appropriate decryption keys, thereby significantly reducing the risk of unintended data exposure. Furthermore, we demonstrate that our encryption strategy preserves the performance and functionality of RAG pipelines, ensuring compatibility across diverse domains and applications. To validate the robustness of our method, we provide comprehensive security proofs that highlight its resilience against potential threats and vulnerabilities. These proofs also reveal limitations in existing approaches, which often lack robustness, adaptability, or reliance on open-source models. Our findings suggest that integrating advanced encryption techniques into the design and deployment of RAG systems can effectively enhance privacy safeguards. This research contributes to the ongoing discourse on improving security measures for AI-driven services and advocates for stricter data protection standards within RAG architectures.

Privacy-Aware RAG: Secure and Isolated Knowledge Retrieval

TL;DR

The paper tackles privacy risks in Retrieval-Augmented Generation (RAG) by introducing an encryption-first framework that protects both textual content and embeddings. It presents two schemes: Method A, AES-CBC-based encryption with per-user keys , and Method B, a chained dynamic key derivation approach with root keys and hash-based integrity, all under a user-isolated access model. Security proofs connect Method A to IND-CPA confidentiality and INT-CTXT via HMAC, while Method B leverages HKDF-based forward security, chain integrity, and trapdoor secrecy, with security guarantees that scale when the parameter . The framework preserves RAG performance, supports cross-domain deployment, and advocates for stricter data-protection standards in AI-driven services.

Abstract

The widespread adoption of Retrieval-Augmented Generation (RAG) systems in real-world applications has heightened concerns about the confidentiality and integrity of their proprietary knowledge bases. These knowledge bases, which play a critical role in enhancing the generative capabilities of Large Language Models (LLMs), are increasingly vulnerable to breaches that could compromise sensitive information. To address these challenges, this paper proposes an advanced encryption methodology designed to protect RAG systems from unauthorized access and data leakage. Our approach encrypts both textual content and its corresponding embeddings prior to storage, ensuring that all data remains securely encrypted. This mechanism restricts access to authorized entities with the appropriate decryption keys, thereby significantly reducing the risk of unintended data exposure. Furthermore, we demonstrate that our encryption strategy preserves the performance and functionality of RAG pipelines, ensuring compatibility across diverse domains and applications. To validate the robustness of our method, we provide comprehensive security proofs that highlight its resilience against potential threats and vulnerabilities. These proofs also reveal limitations in existing approaches, which often lack robustness, adaptability, or reliance on open-source models. Our findings suggest that integrating advanced encryption techniques into the design and deployment of RAG systems can effectively enhance privacy safeguards. This research contributes to the ongoing discourse on improving security measures for AI-driven services and advocates for stricter data protection standards within RAG architectures.

Paper Structure

This paper contains 12 sections, 40 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: The portion enclosed by the green dashed box indicates the output of a correctly guarded RAG system against attacks, while the portion enclosed by the red dashed box indicates the output of the vast majority of current RAGs facing such attacks.
  • Figure 2: As depicted in Figure, the system framework is presented, with arrows illustrating the direction of data flow and different colors denoting the sources of the data. This framework systematically demonstrates how the system processes User A's ID and key to extract User A's text vectors and text. It then computes similarity and securely inputs User A's legitimate information into the LLM for security prompts, ensuring that no information from other users is accessed.
  • Figure 3: In the diagram of Scheme A's knowledge base user encryption and decryption process, the green and blue dashed lines represent the execution flows of different users, the black dashed line represents the database primary key search flow, and the orange box lines indicate the AES encryption and decryption algorithm.
  • Figure 4: The figure is the overall framework diagram of method B, in which the blue and green lines identify the data source, the blue dotted line represents the data encryption process, the blue (green) dotted line represents the connection process of the linked list, and the gray line represents the process in which the user obtains and decrypts the linked list information through his own identity identifier and key through trapdoor. Only the encryption scheme is shown in the figure, and $KeyA_1$ can decrypt the decryption along the linked list.