Table of Contents
Fetching ...

The Empirical Impact of Data Sanitization on Language Models

Anwesan Pal, Radhika Bhargava, Kyle Hinsz, Jacques Esterhuizen, Sudipta Bhattacharya

TL;DR

The results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, while for tasks such as comprehension Q&A there is a big drop of>25% in performance observed in redacted queries as compared to the original.

Abstract

Data sanitization in the context of language modeling involves identifying sensitive content, such as personally identifiable information (PII), and redacting them from a dataset corpus. It is a common practice used in natural language processing (NLP) to maintain privacy. Nevertheless, the impact of data sanitization on the language understanding capability of a language model remains less studied. This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks including comprehension question answering (Q&A), entailment, sentiment analysis, and text classification. Our experiments cover a wide spectrum comprising finetuning small-scale language models, to prompting large language models (LLMs), on both original and sanitized datasets, and comparing their performance across the tasks. Interestingly, our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%, while for tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original. For tasks that have a higher impact, we perform a deeper dive to inspect the presence of task-critical entities. Finally, we investigate correlation between performance and number of redacted entities, and also suggest a strategy to repair an already redacted dataset by means of content-based subsampling. Additional details are available at https://sites.google.com/view/datasan.

The Empirical Impact of Data Sanitization on Language Models

TL;DR

The results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, while for tasks such as comprehension Q&A there is a big drop of>25% in performance observed in redacted queries as compared to the original.

Abstract

Data sanitization in the context of language modeling involves identifying sensitive content, such as personally identifiable information (PII), and redacting them from a dataset corpus. It is a common practice used in natural language processing (NLP) to maintain privacy. Nevertheless, the impact of data sanitization on the language understanding capability of a language model remains less studied. This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks including comprehension question answering (Q&A), entailment, sentiment analysis, and text classification. Our experiments cover a wide spectrum comprising finetuning small-scale language models, to prompting large language models (LLMs), on both original and sanitized datasets, and comparing their performance across the tasks. Interestingly, our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%, while for tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original. For tasks that have a higher impact, we perform a deeper dive to inspect the presence of task-critical entities. Finally, we investigate correlation between performance and number of redacted entities, and also suggest a strategy to repair an already redacted dataset by means of content-based subsampling. Additional details are available at https://sites.google.com/view/datasan.

Paper Structure

This paper contains 19 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: LLM chain-of-thought for a query in the original vs redacted forms. In the redacted case, the reasoning module correctly highlights missing information in the query, and is therefore unable to provide the answer to an otherwise straight-forward question.
  • Figure 2: Mistral's hallucination in the context of entity redaction. As shown in the figure, Claude and GPT models correctly highlight the lack of information present in the query due to redaction, and proceed to not provide any answer. In contrast, Mistral assigns sequential values to various <NAME> tags, and reasons about them to arrive to the correct answer. This explains the trend of Mistral's performance getting less impacted by redaction as compared to the other models.
  • Figure 3: Performance of random vs content sampling with replacement for all high-impact datasets. The trend shows that randomly redacting a portion of the dataset leads to a linear drop in performance, whereas by redacting samples based on the PII content leads to a non-linear drop. This non-linearity trend is more prominent for GSM8k and BBH datasets, while less for DROP dataset. We hypothesize the reason to be related to a more uniform distribution of PII content in DROP dataset, thereby making the sampling methods equivalent.
  • Figure 4: Performance of LLMs on a fraction of the dataset obtained by random vs content sampling. The trend shows that for SQuADv2.0 and GSM8k datasets, it is possible to repair these datasets by removing samples that are heavily redacted. Interestingly, DROP does not follow this trend. We hypothesize this to be due to the uniformly diverse PII present content there, ensuring that simply by removing samples based on the count does not ensure performance improvement.
  • Figure 5: [Best viewed in color] The figure illustrates that DROP dataset has a diverse number of PII entities present, but that does not necessarily impact performance when the question asks about a specific unredacted portion of a long redacted passage.
  • ...and 1 more figures