Protecting Privacy in Classifiers by Token Manipulation
Re'em Harel, Yair Elboher, Yuval Pinter
TL;DR
This work tackles privacy concerns when sending text to remote LLM-based classifiers by proposing token-level privacy mechanisms. It first analyzes simple lossy token mappers (e.g., purely random, high-frequency, and low-frequency mappings) and demonstrates that naive approaches can be reversed with feasible effort, while maintaining limited downstream impact in some cases. It then introduces Stencil, a context-aware technique that aggregates information from neighboring tokens with a Gaussian window (parameters $n$ and $σ$) to produce a privacy-preserving token that preserves task performance better than noise-based baselines. Across SST2, IMDb, and QNLI, Stencil achieves notable privacy gains with modest accuracy losses, and its resilience to nearest-neighbor reconstruction attacks surpasses that of naive mappers, offering a practical path toward private text classification when model parameters remain inaccessible.
Abstract
Using language models as a remote service entails sending private information to an untrusted provider. In addition, potential eavesdroppers can intercept the messages, thereby exposing the information. In this work, we explore the prospects of avoiding such data exposure at the level of text manipulation. We focus on text classification models, examining various token mapping and contextualized manipulation functions in order to see whether classifier accuracy may be maintained while keeping the original text unrecoverable. We find that although some token mapping functions are easy and straightforward to implement, they heavily influence performance on the downstream task, and via a sophisticated attacker can be reconstructed. In comparison, the contextualized manipulation provides an improvement in performance.
