Table of Contents
Fetching ...

Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries

Yuhao Wang, Wenjie Qu, Shengfang Zhai, Yanze Jiang, Zichen Liu, Yue Liu, Yinpeng Dong, Jiaheng Zhang

TL;DR

This paper tackles copyright and privacy risks in Retrieval-Augmented Generation by showing that valuable internal knowledge can be extracted through benign queries. It introduces IKEA, a stealthy attack that leverages anchor concepts and two mechanisms, Experience Reflection Sampling and Trust Region Directed Mutation, to progressively reveal RAG knowledge while evading detection. Extensive experiments across healthcare, literature, and gaming domains demonstrate high extraction efficiency and attack success even under defenses, and show that a substitute RAG built from IKEA extracted knowledge can achieve performance close to the original. The work highlights a subtle attack surface in RAG systems and underscores the need for robust safeguards and auditing tools to prevent unauthorized data leakage.

Abstract

Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by incorporating external knowledge bases, but this may expose them to extraction attacks, leading to potential copyright and privacy risks. However, existing extraction methods typically rely on malicious inputs such as prompt injection or jailbreaking, making them easily detectable via input- or output-level detection. In this paper, we introduce Implicit Knowledge Extraction Attack (IKEA), which conducts Knowledge Extraction on RAG systems through benign queries. Specifically, IKEA first leverages anchor concepts-keywords related to internal knowledge-to generate queries with a natural appearance, and then designs two mechanisms that lead anchor concepts to thoroughly "explore" the RAG's knowledge: (1) Experience Reflection Sampling, which samples anchor concepts based on past query-response histories, ensuring their relevance to the topic; (2) Trust Region Directed Mutation, which iteratively mutates anchor concepts under similarity constraints to further exploit the embedding space. Extensive experiments demonstrate IKEA's effectiveness under various defenses, surpassing baselines by over 80% in extraction efficiency and 90% in attack success rate. Moreover, the substitute RAG system built from IKEA's extractions shows comparable performance to the original RAG and outperforms those based on baselines across multiple evaluation tasks, underscoring the stealthy copyright infringement risk in RAG systems.

Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries

TL;DR

This paper tackles copyright and privacy risks in Retrieval-Augmented Generation by showing that valuable internal knowledge can be extracted through benign queries. It introduces IKEA, a stealthy attack that leverages anchor concepts and two mechanisms, Experience Reflection Sampling and Trust Region Directed Mutation, to progressively reveal RAG knowledge while evading detection. Extensive experiments across healthcare, literature, and gaming domains demonstrate high extraction efficiency and attack success even under defenses, and show that a substitute RAG built from IKEA extracted knowledge can achieve performance close to the original. The work highlights a subtle attack surface in RAG systems and underscores the need for robust safeguards and auditing tools to prevent unauthorized data leakage.

Abstract

Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by incorporating external knowledge bases, but this may expose them to extraction attacks, leading to potential copyright and privacy risks. However, existing extraction methods typically rely on malicious inputs such as prompt injection or jailbreaking, making them easily detectable via input- or output-level detection. In this paper, we introduce Implicit Knowledge Extraction Attack (IKEA), which conducts Knowledge Extraction on RAG systems through benign queries. Specifically, IKEA first leverages anchor concepts-keywords related to internal knowledge-to generate queries with a natural appearance, and then designs two mechanisms that lead anchor concepts to thoroughly "explore" the RAG's knowledge: (1) Experience Reflection Sampling, which samples anchor concepts based on past query-response histories, ensuring their relevance to the topic; (2) Trust Region Directed Mutation, which iteratively mutates anchor concepts under similarity constraints to further exploit the embedding space. Extensive experiments demonstrate IKEA's effectiveness under various defenses, surpassing baselines by over 80% in extraction efficiency and 90% in attack success rate. Moreover, the substitute RAG system built from IKEA's extractions shows comparable performance to the original RAG and outperforms those based on baselines across multiple evaluation tasks, underscoring the stealthy copyright infringement risk in RAG systems.

Paper Structure

This paper contains 41 sections, 1 theorem, 27 equations, 7 figures, 18 tables.

Key Result

Theorem 1

Let $q,y\in\mathbb{R}^d\setminus\{0\}$ and define the unit vectors $\hat{q}:=q/\|q\|$, $\hat{y}:=y/\|y\|$. With $\gamma\in(0,1)$ and $\langle \hat{q},\hat{y}\rangle>0$, consider Then any minimizer $w^\star$ of P satisfies i.e. the optimum lies on the boundary of the trust region.

Figures (7)

  • Figure 1: The illustration comparing Verbatim Extraction using malicious queries (such as Prompt-injection qi2025spillzeng2024goodjiang2024rag and Jailbreak cohen2024unleashing methods) and Knowledge Extraction using benign queries (Our method).
  • Figure 2: The IKEA pipeline is shown above: Attackers ❶ initialize anchor database with topic keywords (\ref{['sec:init']}), ❷ sample anchor concepts from the database based on query history via Experience Reflection (\ref{['sec: ER_sample']}), ❸ generate implicit queries based on anchor concepts (\ref{['sec:init']}) and query RAG system, ❹ update query-response history, ❺ judge whether to end mutation (\ref{['Sec:TRDM']}), ❻ utilize TRDM (\ref{['Sec:TRDM']}) to generate new anchor concepts if mutation does not stop, otherwise, start another round of sampling.
  • Figure 3: Illustration of Trust Region Directed Mutation (TRDM) algorithm. We mutate anchor concepts under similarity constraints to exploit the embedding space, progressively covering the entire target dataset.
  • Figure 4: Result of MCQ and QA with three different knowledge bases. Extracted indicates extracted chunks with IKEA, Origin indicates origin chunk of evaluation datasets, Empty indicates no reference contexts are provided for answering questions.
  • Figure 5: T-SNE projection RAG databases and topics.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1: Boundary optimality under a cosine trust region
  • proof