Keep It Private: Unsupervised Privatization of Online Text
Calvin Bao, Marine Carpuat
TL;DR
The paper tackles privacy in online text by addressing authorship attribution through unsupervised privatization. It introduces Keep It Private (KiP), a reinforcement-learning framework that fine-tunes large language models to produce meaning-preserving rewrites that obscure author identity, guided by a multi-component reward including neural author-embedding privacy (LUAR), semantic preservation (SBERT), and grammatical soundness (CoLA). Evaluated on a large Reddit corpus and an out-of-domain BLOG dataset, KiP models substantially improve attribution and verification evasion relative to baselines, with paraphrase-focused variants offering the best balance between privacy and content fidelity. The results highlight the potential and challenges of deployable privacy-preserving text rewriting and call for robust, multi-adversary benchmarks to assess real-world risk.
Abstract
Authorship obfuscation techniques hold the promise of helping people protect their privacy in online communications by automatically rewriting text to hide the identity of the original author. However, obfuscation has been evaluated in narrow settings in the NLP literature and has primarily been addressed with superficial edit operations that can lead to unnatural outputs. In this work, we introduce an automatic text privatization framework that fine-tunes a large language model via reinforcement learning to produce rewrites that balance soundness, sense, and privacy. We evaluate it extensively on a large-scale test set of English Reddit posts by 68k authors composed of short-medium length texts. We study how the performance changes among evaluative conditions including authorial profile length and authorship detection strategy. Our method maintains high text quality according to both automated metrics and human evaluation, and successfully evades several automated authorship attacks.
