Table of Contents
Fetching ...

Protecting De-identified Documents from Search-based Linkage Attacks

Pierre Lison, Mark Anderson

TL;DR

The work tackles the risk of record linkages in de-identified text by focusing on search-based, phrase-level attacks. It introduces a two-step method: construct an inverted index of N-grams to identify low-frequency spans (< $k$ documents) and iteratively rewrite those spans with an instruction-tuned LLM in-context to erase linkages while preserving meaning. Empirical results on a court-case dataset show substantial linkage reduction (up to ~99.8% for arity ≤ 3) with strong semantic fidelity, outperforming DP-based and handcrafted baselines. The approach relies on access to the original document pool and specifically targets phrase-based linkages, pointing to future work on broader linkage threats and multilingual datasets.

Abstract

While de-identification models can help conceal the identity of the individual(s) mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and then check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the document collection, making it possible to efficiently determine which N-grams appear in less than $k$ documents (either alone or in combination with other N-grams). An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on a collection of court cases show that the method is able to effectively prevent search-based linkages while remaining faithful to the original content.

Protecting De-identified Documents from Search-based Linkage Attacks

TL;DR

The work tackles the risk of record linkages in de-identified text by focusing on search-based, phrase-level attacks. It introduces a two-step method: construct an inverted index of N-grams to identify low-frequency spans (< documents) and iteratively rewrite those spans with an instruction-tuned LLM in-context to erase linkages while preserving meaning. Empirical results on a court-case dataset show substantial linkage reduction (up to ~99.8% for arity ≤ 3) with strong semantic fidelity, outperforming DP-based and handcrafted baselines. The approach relies on access to the original document pool and specifically targets phrase-based linkages, pointing to future work on broader linkage threats and multilingual datasets.

Abstract

While de-identification models can help conceal the identity of the individual(s) mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and then check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the document collection, making it possible to efficiently determine which N-grams appear in less than documents (either alone or in combination with other N-grams). An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on a collection of court cases show that the method is able to effectively prevent search-based linkages while remaining faithful to the original content.

Paper Structure

This paper contains 27 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: To prevent search-based linkages between a de-identified document and its original source, we identify all N-grams occurring less than $k$ times in the collection, and use an LLM to reformulate each N-gram (in its local context) until the linkages are averted.