Words Blending Boxes. Obfuscating Queries in Information Retrieval using Differential Privacy

Francesco Luigi De Faveri; Guglielmo Faggioli; Nicola Ferro

Words Blending Boxes. Obfuscating Queries in Information Retrieval using Differential Privacy

Francesco Luigi De Faveri, Guglielmo Faggioli, Nicola Ferro

TL;DR

The paper tackles user privacy in information retrieval by proposing Word Blending Boxes (WBB), a DP-based query obfuscation mechanism that creates safe and candidate boxes in a word embedding space to ensure obfuscated terms are neither identical to nor semantically too close to originals. WBB uses an exponential mechanism-driven sampling with a formal DP guarantee ($\varepsilon$-DP) and a rigorous preprocessing/mapping/sampling pipeline to generate safe obfuscated queries. Through extensive experiments on two TREC collections and multiple IR models, the authors demonstrate that WBB achieves substantive privacy (low lexical and semantic similarity to originals) while preserving retrieval utility, especially in semantic IR settings, and provide clear guidance on parameter choices ($k$, $n$, distance function). The work presents a practical, DP-grounded approach to query obfuscation that outperforms prior DP and non-DP methods in several privacy-utility scenarios, with robust theoretical guarantees and actionable experimental insights.

Abstract

Ensuring the effectiveness of search queries while protecting user privacy remains an open issue. When an Information Retrieval System (IRS) does not protect the privacy of its users, sensitive information may be disclosed through the queries sent to the system. Recent improvements, especially in NLP, have shown the potential of using Differential Privacy to obfuscate texts while maintaining satisfactory effectiveness. However, such approaches may protect the user's privacy only from a theoretical perspective while, in practice, the real user's information need can still be inferred if perturbed terms are too semantically similar to the original ones. We overcome such limitations by proposing Word Blending Boxes, a novel differentially private mechanism for query obfuscation, which protects the words in the user queries by employing safe boxes. To measure the overall effectiveness of the proposed WBB mechanism, we measure the privacy obtained by the obfuscation process, i.e., the lexical and semantic similarity between original and obfuscated queries. Moreover, we assess the effectiveness of the privatized queries in retrieving relevant documents from the IRS. Our findings indicate that WBB can be integrated effectively into existing IRSs, offering a key to the challenge of protecting user privacy from both a theoretical and a practical point of view.

Words Blending Boxes. Obfuscating Queries in Information Retrieval using Differential Privacy

TL;DR

-DP) and a rigorous preprocessing/mapping/sampling pipeline to generate safe obfuscated queries. Through extensive experiments on two TREC collections and multiple IR models, the authors demonstrate that WBB achieves substantive privacy (low lexical and semantic similarity to originals) while preserving retrieval utility, especially in semantic IR settings, and provide clear guidance on parameter choices (

, distance function). The work presents a practical, DP-grounded approach to query obfuscation that outperforms prior DP and non-DP methods in several privacy-utility scenarios, with robust theoretical guarantees and actionable experimental insights.

Abstract

Paper Structure (31 sections, 1 theorem, 9 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 31 sections, 1 theorem, 9 equations, 6 figures, 3 tables, 1 algorithm.

Introduction
Background on Differential Privacy
Privacy Definitions
The Exponential Mechanism
Background on Text Obfuscation
Text Obfuscation in NLP
Query Obfuscation in IR
The query obfuscation pipeline
State-of-the-art Approches
Privacy Measures
Methodology
Motivations of the study
Semantically related words in embedding spaces
WBB Mechanism
Preprocessing
...and 16 more sections

Key Result

Theorem 1

The mechanism $\mathcal{M}$ explained in Algorithm alg:mechanism is $\varepsilon$-Differentially Private.

Figures (6)

Figure 1: General overview of the query obfuscation pipeline in IR. The diagram illustrates the pipeline of the retrieval, showing the steps on the User, safe, and IRS, unsafe, side.
Figure 2: Geometric intuition of the vector space.
Figure 3: Radar plot showing the distribution of sampled words and the word "Death" considering Euclidean distance and angle. The symbols of crosses ($\times$) and squares ($\square$) are utilized to represent the linguistic relations of a given word. Specifically, the crosses indicate the hyponyms, while the squares represent the synonyms. The grey circles ($\circ$) represent other words that are neither hyponyms nor synonyms.
Figure 4: WBB mechanism schematic overview of the obfuscation procedure. Compared to the pipeline presented in Figure \ref{['fig:pipelinemoverview']}, this is the "Obfuscation Mechanism" component.
Figure 5: Comparison matrices of the Recall for different obfuscations (angle, euclidean distance, and product), each at the same level of $\varepsilon$-DP ($\varepsilon=10$), computed using Contriever and the DL'19 collection. With no privacy, Recall$=0.528$
...and 1 more figures

Theorems & Definitions (2)

Theorem 1
proof

Words Blending Boxes. Obfuscating Queries in Information Retrieval using Differential Privacy

TL;DR

Abstract

Words Blending Boxes. Obfuscating Queries in Information Retrieval using Differential Privacy

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)