Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval Systems

Shengyao Zhuang; Bevan Koopman; Xiaoran Chu; Guido Zuccon

Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval Systems

Shengyao Zhuang, Bevan Koopman, Xiaoran Chu, Guido Zuccon

TL;DR

Vec2Text poses privacy risks by reconstructing original text from embeddings used in dense retrieval. The paper reproduces Vec2Text with fixes and evaluates its reconstructibility across embedding models and strategies, including distance metrics, pooling, bottleneck pre-training, and quantization. It finds that mean pooling and bottleneck pre-training increase reconstructibility, while dimensionality reduction and product quantization can mitigate reconstructibility without hurting retrieval. It proposes mitigation strategies including noise injection and a user-specific embedding transformation that preserves retrieval while degrading reconstructibility, highlighting practical ways to privacy-protect dense retrieval systems.

Abstract

The emergence of Vec2Text -- a method for text embedding inversion -- has raised serious privacy concerns for dense retrieval systems which use text embeddings, such as those offered by OpenAI and Cohere. This threat comes from the ability for a malicious attacker with access to embeddings to reconstruct the original text. In this paper, we investigate various factors related to embedding models that may impact text recoverability via Vec2Text. We explore factors such as distance metrics, pooling functions, bottleneck pre-training, training with noise addition, embedding quantization, and embedding dimensions, which were not considered in the original Vec2Text paper. Through a comprehensive analysis of these factors, our objective is to gain a deeper understanding of the key elements that affect the trade-offs between the text recoverability and retrieval effectiveness of dense retrieval systems, offering insights for practitioners designing privacy-aware dense retrieval systems. We also propose a simple embedding transformation fix that guarantees equal ranking effectiveness while mitigating the recoverability risk. Overall, this study reveals that Vec2Text could pose a threat to current dense retrieval systems, but there are some effective methods to patch such systems.

Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval Systems

TL;DR

Abstract

Paper Structure (13 sections, 2 equations, 2 figures, 7 tables)

This paper contains 13 sections, 2 equations, 2 figures, 7 tables.

Introduction
The Vec2Text Method
Reproduction of vec2text
Experimental Methodology
Reproduction Results
Understanding what impacts Vec2Text effectiveness
Distance Metric and Pooling Method
Zero-shot Regime and Bottleneck Pre-training
Embedding Dimensionality & Quantization
Mitigation Strategies
Noise Injection
Embedding Transformation
Conclusion

Figures (2)

Figure 1: Overview of Vec2Text, taken from morris-etal-2023-text with permission. "Given access to a target embedding $e$ (blue) and query access to an embedding model $\phi$ (blue model), the system aims to iteratively generate (yellow model) hypotheses $\hat{e}$ (pink) to reach the target. Example input is a [passage] taken from a recent Wikipedia article (June 2023). Vec2Text perfectly recovers this text from its embedding after 4 rounds of correction."
Figure 2: Impact on retrieval effectiveness and reconstructibility of different amounts of noise injection with different retrieval models. Larger $\lambda$ signifies more noise injection.

Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval Systems

TL;DR

Abstract

Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (2)