Words Blending Boxes. Obfuscating Queries in Information Retrieval using Differential Privacy
Francesco Luigi De Faveri, Guglielmo Faggioli, Nicola Ferro
TL;DR
The paper tackles user privacy in information retrieval by proposing Word Blending Boxes (WBB), a DP-based query obfuscation mechanism that creates safe and candidate boxes in a word embedding space to ensure obfuscated terms are neither identical to nor semantically too close to originals. WBB uses an exponential mechanism-driven sampling with a formal DP guarantee ($\varepsilon$-DP) and a rigorous preprocessing/mapping/sampling pipeline to generate safe obfuscated queries. Through extensive experiments on two TREC collections and multiple IR models, the authors demonstrate that WBB achieves substantive privacy (low lexical and semantic similarity to originals) while preserving retrieval utility, especially in semantic IR settings, and provide clear guidance on parameter choices ($k$, $n$, distance function). The work presents a practical, DP-grounded approach to query obfuscation that outperforms prior DP and non-DP methods in several privacy-utility scenarios, with robust theoretical guarantees and actionable experimental insights.
Abstract
Ensuring the effectiveness of search queries while protecting user privacy remains an open issue. When an Information Retrieval System (IRS) does not protect the privacy of its users, sensitive information may be disclosed through the queries sent to the system. Recent improvements, especially in NLP, have shown the potential of using Differential Privacy to obfuscate texts while maintaining satisfactory effectiveness. However, such approaches may protect the user's privacy only from a theoretical perspective while, in practice, the real user's information need can still be inferred if perturbed terms are too semantically similar to the original ones. We overcome such limitations by proposing Word Blending Boxes, a novel differentially private mechanism for query obfuscation, which protects the words in the user queries by employing safe boxes. To measure the overall effectiveness of the proposed WBB mechanism, we measure the privacy obtained by the obfuscation process, i.e., the lexical and semantic similarity between original and obfuscated queries. Moreover, we assess the effectiveness of the privatized queries in retrieving relevant documents from the IRS. Our findings indicate that WBB can be integrated effectively into existing IRSs, offering a key to the challenge of protecting user privacy from both a theoretical and a practical point of view.
