Keep It Private: Unsupervised Privatization of Online Text

Calvin Bao; Marine Carpuat

Keep It Private: Unsupervised Privatization of Online Text

Calvin Bao, Marine Carpuat

TL;DR

The paper tackles privacy in online text by addressing authorship attribution through unsupervised privatization. It introduces Keep It Private (KiP), a reinforcement-learning framework that fine-tunes large language models to produce meaning-preserving rewrites that obscure author identity, guided by a multi-component reward including neural author-embedding privacy (LUAR), semantic preservation (SBERT), and grammatical soundness (CoLA). Evaluated on a large Reddit corpus and an out-of-domain BLOG dataset, KiP models substantially improve attribution and verification evasion relative to baselines, with paraphrase-focused variants offering the best balance between privacy and content fidelity. The results highlight the potential and challenges of deployable privacy-preserving text rewriting and call for robust, multi-adversary benchmarks to assess real-world risk.

Abstract

Authorship obfuscation techniques hold the promise of helping people protect their privacy in online communications by automatically rewriting text to hide the identity of the original author. However, obfuscation has been evaluated in narrow settings in the NLP literature and has primarily been addressed with superficial edit operations that can lead to unnatural outputs. In this work, we introduce an automatic text privatization framework that fine-tunes a large language model via reinforcement learning to produce rewrites that balance soundness, sense, and privacy. We evaluate it extensively on a large-scale test set of English Reddit posts by 68k authors composed of short-medium length texts. We study how the performance changes among evaluative conditions including authorial profile length and authorship detection strategy. Our method maintains high text quality according to both automated metrics and human evaluation, and successfully evades several automated authorship attacks.

Keep It Private: Unsupervised Privatization of Online Text

TL;DR

Abstract

Paper Structure (42 sections, 5 equations, 6 figures, 5 tables)

This paper contains 42 sections, 5 equations, 6 figures, 5 tables.

Introduction
Background
Authorship Identification
Authorship Obfuscation
Related Tasks
Approach: The "Keep it Private" Model for Authorship Obfuscation
Base Language Models
Rewards
Privacy
Meaning Preservation
Soundness
Guardrails
Overall Reward Function
Experimental Design
Data
...and 27 more sections

Figures (6)

Figure 1: Authorship obfuscation as tested by attribution and verification attacks. A verification attack asks: Are the Original and Obfuscated texts written by the same author? An attribution attack asks: which author is the Obfuscated text written by among a set of candidate authors, represented by their author profiles?
Figure 2: Higher values on the Y-axis indicate better performance of the adversarial model, and thus, worse performance in the obfuscation. A subset of baselines (Stylo, RT MT, Copy) is compared against the KiP-Bart-Para model over the first five powers-of-2 progressions for author profile size: 1 $\rightarrow$ 16 comments.
Figure 3: Results from a crowdsourced paraphrase pair evaluation. Systems are ordered from strongest (left) to weakest (right) in automatic privacy performance. Meanwhile, KiP-DIPPER produces more grammatical paraphrases than the other models, validating KiP-DIPPER's rewriting promise for achieving privacy and meaning preservation jointly.
Figure 4: Instructions given to study participants. 27 English-fluent participants were recruited via Prolific.
Figure 5: A sample multiple-choice question given to annotators.
...and 1 more figures

Keep It Private: Unsupervised Privatization of Online Text

TL;DR

Abstract

Keep It Private: Unsupervised Privatization of Online Text

TL;DR

Abstract

Table of Contents

Figures (6)