Table of Contents
Fetching ...

Privacy-Preserved Neural Graph Databases

Qi Hu, Haoran Li, Jiaxin Bai, Zihao Wang, Yangqiu Song

TL;DR

This paper addresses privacy leakage risks in privacy-sensitive neural graph databases (NGDBs) arising from complex query answering. It proposes Privacy-Preserved NGDB (P-NGDB), which uses adversarial training to obfuscate private information while preserving public query accuracy, formalizes privacy definitions, and builds a three-dataset benchmark (FB15K-N, DB15K-N, YAGO15K-N) to evaluate both CQA performance and privacy protection. The approach combines query encoding with a dual-objective learning framework $L = L_u + \beta L_p$, enabling controllable privacy protection via the parameter $\beta$. Empirical results show that P-NGDB effectively reduces privacy leakage with only a modest drop in public retrieval quality, outperforming simple noise-based baselines and offering a practical path toward safer RAG over domain-private graphs.

Abstract

In the era of large language models (LLMs), efficient and accurate data retrieval has become increasingly crucial for the use of domain-specific or private data in the retrieval augmented generation (RAG). Neural graph databases (NGDBs) have emerged as a powerful paradigm that combines the strengths of graph databases (GDBs) and neural networks to enable efficient storage, retrieval, and analysis of graph-structured data which can be adaptively trained with LLMs. The usage of neural embedding storage and Complex neural logical Query Answering (CQA) provides NGDBs with generalization ability. When the graph is incomplete, by extracting latent patterns and representations, neural graph databases can fill gaps in the graph structure, revealing hidden relationships and enabling accurate query answering. Nevertheless, this capability comes with inherent trade-offs, as it introduces additional privacy risks to the domain-specific or private databases. Malicious attackers can infer more sensitive information in the database using well-designed queries such as from the answer sets of where Turing Award winners born before 1950 and after 1940 lived, the living places of Turing Award winner Hinton are probably exposed, although the living places may have been deleted in the training stage due to the privacy concerns. In this work, we propose a privacy-preserved neural graph database (P-NGDB) framework to alleviate the risks of privacy leakage in NGDBs. We introduce adversarial training techniques in the training stage to enforce the NGDBs to generate indistinguishable answers when queried with private information, enhancing the difficulty of inferring sensitive information through combinations of multiple innocuous queries.

Privacy-Preserved Neural Graph Databases

TL;DR

This paper addresses privacy leakage risks in privacy-sensitive neural graph databases (NGDBs) arising from complex query answering. It proposes Privacy-Preserved NGDB (P-NGDB), which uses adversarial training to obfuscate private information while preserving public query accuracy, formalizes privacy definitions, and builds a three-dataset benchmark (FB15K-N, DB15K-N, YAGO15K-N) to evaluate both CQA performance and privacy protection. The approach combines query encoding with a dual-objective learning framework , enabling controllable privacy protection via the parameter . Empirical results show that P-NGDB effectively reduces privacy leakage with only a modest drop in public retrieval quality, outperforming simple noise-based baselines and offering a practical path toward safer RAG over domain-private graphs.

Abstract

In the era of large language models (LLMs), efficient and accurate data retrieval has become increasingly crucial for the use of domain-specific or private data in the retrieval augmented generation (RAG). Neural graph databases (NGDBs) have emerged as a powerful paradigm that combines the strengths of graph databases (GDBs) and neural networks to enable efficient storage, retrieval, and analysis of graph-structured data which can be adaptively trained with LLMs. The usage of neural embedding storage and Complex neural logical Query Answering (CQA) provides NGDBs with generalization ability. When the graph is incomplete, by extracting latent patterns and representations, neural graph databases can fill gaps in the graph structure, revealing hidden relationships and enabling accurate query answering. Nevertheless, this capability comes with inherent trade-offs, as it introduces additional privacy risks to the domain-specific or private databases. Malicious attackers can infer more sensitive information in the database using well-designed queries such as from the answer sets of where Turing Award winners born before 1950 and after 1940 lived, the living places of Turing Award winner Hinton are probably exposed, although the living places may have been deleted in the training stage due to the privacy concerns. In this work, we propose a privacy-preserved neural graph database (P-NGDB) framework to alleviate the risks of privacy leakage in NGDBs. We introduce adversarial training techniques in the training stage to enforce the NGDBs to generate indistinguishable answers when queried with private information, enhancing the difficulty of inferring sensitive information through combinations of multiple innocuous queries.
Paper Structure (27 sections, 16 equations, 5 figures, 4 tables)

This paper contains 27 sections, 16 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Privacy risks of NGDBs facing malicious queries. To illustrate the issue, consider an example where an attacker attempts to infer private information about Hinton's living place in the NGDBs. Direct querying private information can be easily detected by privacy risk detection, however, attackers can leverage well-designed queries to retrieve desired privacy. In this example, the privacy attacker can query "where Turing Award winners born after 1940 lived," "where ones before 1950 lived," and "where LeCun's collaborator lived," etc. And the query can be more complex as shown in privacy risk queries. The return answer denoted in red may leak private information in NGDBs. The intersection of these queries still presents a significant likelihood of exposing the living place of Turing Award winner Hinton.
  • Figure 2: An example of query demonstrating the retrieved privacy-threatening query answers. The orange node denotes privacy risks. (A) The logic knowledge graph query involves privacy information. (B) An example of a complex query in the knowledge graph. Toronto is regarded as a privacy-threatening answer as it has to be inferred by sensitive information.
  • Figure 3: Example of privacy-threatening answer sets computation in projection, intersection, and union. Green nodes denote non-private answers, orange nodes denote privacy-threatening answers, and green-orange nodes denote different privacy risks in subsets. Red dashed arrows denote privacy projection. The range demarcated by the red dashed lines denotes privacy-threatening answer sets. The answers circled in red dashed line are at risk of leaking privacy.
  • Figure 4: Eight general query types. Black, blue, and orange arrows denote projection, intersection, and union operators respectively.
  • Figure 5: The evaluation results of GQE with various privacy coefficients $\beta$ on FB15K-N.