Protecting Privacy in Classifiers by Token Manipulation

Re'em Harel; Yair Elboher; Yuval Pinter

Protecting Privacy in Classifiers by Token Manipulation

Re'em Harel, Yair Elboher, Yuval Pinter

TL;DR

This work tackles privacy concerns when sending text to remote LLM-based classifiers by proposing token-level privacy mechanisms. It first analyzes simple lossy token mappers (e.g., purely random, high-frequency, and low-frequency mappings) and demonstrates that naive approaches can be reversed with feasible effort, while maintaining limited downstream impact in some cases. It then introduces Stencil, a context-aware technique that aggregates information from neighboring tokens with a Gaussian window (parameters $n$ and $σ$) to produce a privacy-preserving token that preserves task performance better than noise-based baselines. Across SST2, IMDb, and QNLI, Stencil achieves notable privacy gains with modest accuracy losses, and its resilience to nearest-neighbor reconstruction attacks surpasses that of naive mappers, offering a practical path toward private text classification when model parameters remain inaccessible.

Abstract

Using language models as a remote service entails sending private information to an untrusted provider. In addition, potential eavesdroppers can intercept the messages, thereby exposing the information. In this work, we explore the prospects of avoiding such data exposure at the level of text manipulation. We focus on text classification models, examining various token mapping and contextualized manipulation functions in order to see whether classifier accuracy may be maintained while keeping the original text unrecoverable. We find that although some token mapping functions are easy and straightforward to implement, they heavily influence performance on the downstream task, and via a sophisticated attacker can be reconstructed. In comparison, the contextualized manipulation provides an improvement in performance.

Protecting Privacy in Classifiers by Token Manipulation

TL;DR

and

) to produce a privacy-preserving token that preserves task performance better than noise-based baselines. Across SST2, IMDb, and QNLI, Stencil achieves notable privacy gains with modest accuracy losses, and its resilience to nearest-neighbor reconstruction attacks surpasses that of naive mappers, offering a practical path toward private text classification when model parameters remain inaccessible.

Abstract

Paper Structure (13 sections, 1 equation, 2 figures, 6 tables)

This paper contains 13 sections, 1 equation, 2 figures, 6 tables.

Introduction
Lossy Mapping
Purely random mapping
High-frequency mapping
Low-frequency mapping
Task Performance
Brute-force Attacker
Resilience Against Reconstruction Attacks
Stencil Privacy Preservation
Downstream Task Performance
Nearest-neighbor Reconstruction
Impact of Window Size and $\sigma$
Conclusion

Figures (2)

Figure 1: A schematic of the various stages where differential privacy techniques can be applied in an LLM. This work focuses on level (B).
Figure 2: Schematic overview of the proposed heuristic oracle attacking scenario path over trying to reconstruct the sentence "what a nice day" which is remapped to "what what nice unicorn". The red boxes indicate that the probability (presented above the box) of the candidate is low enough to be dropped in the next step, while the green boxes are the candidates that will be expanded in the next step.

Protecting Privacy in Classifiers by Token Manipulation

TL;DR

Abstract

Protecting Privacy in Classifiers by Token Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)