Table of Contents
Fetching ...

TAROT: Task-Oriented Authorship Obfuscation Using Policy Optimization Methods

Gabriel Loiseau, Damien Sileo, Damien Riquet, Maxime Meyer, Marc Tommasi

TL;DR

TAROT addresses the privacy-utility trade-off in authorship obfuscation by reframing AO as a task-driven, open-world problem and solving it with policy optimization. Starting from a supervised fine-tuning baseline, TAROT applies PPO or Direct Preference Optimization to rewrite entire texts while optimizing a joint reward that favors text utility and suppresses author-specific signals. Empirical results across IMDb, BAC, and AMT datasets show TAROT substantially reduces authorship attribution accuracy while preserving downstream task performance, with DPO generally outperforming PPO. The work demonstrates that obfuscated texts can even serve as effective training data for downstream utilities, while highlighting limitations of current LMs and the necessity for careful ethical considerations in deployment.

Abstract

Authorship obfuscation aims to disguise the identity of an author within a text by altering the writing style, vocabulary, syntax, and other linguistic features associated with the text author. This alteration needs to balance privacy and utility. While strong obfuscation techniques can effectively hide the author's identity, they often degrade the quality and usefulness of the text for its intended purpose. Conversely, maintaining high utility tends to provide insufficient privacy, making it easier for an adversary to de-anonymize the author. Thus, achieving an optimal trade-off between these two conflicting objectives is crucial. In this paper, we propose TAROT: Task-Oriented Authorship Obfuscation Using Policy Optimization, a new unsupervised authorship obfuscation method whose goal is to optimize the privacy-utility trade-off by regenerating the entire text considering its downstream utility. Our approach leverages policy optimization as a fine-tuning paradigm over small language models in order to rewrite texts by preserving author identity and downstream task utility. We show that our approach largely reduces the accuracy of attackers while preserving utility. We make our code and models publicly available.

TAROT: Task-Oriented Authorship Obfuscation Using Policy Optimization Methods

TL;DR

TAROT addresses the privacy-utility trade-off in authorship obfuscation by reframing AO as a task-driven, open-world problem and solving it with policy optimization. Starting from a supervised fine-tuning baseline, TAROT applies PPO or Direct Preference Optimization to rewrite entire texts while optimizing a joint reward that favors text utility and suppresses author-specific signals. Empirical results across IMDb, BAC, and AMT datasets show TAROT substantially reduces authorship attribution accuracy while preserving downstream task performance, with DPO generally outperforming PPO. The work demonstrates that obfuscated texts can even serve as effective training data for downstream utilities, while highlighting limitations of current LMs and the necessity for careful ethical considerations in deployment.

Abstract

Authorship obfuscation aims to disguise the identity of an author within a text by altering the writing style, vocabulary, syntax, and other linguistic features associated with the text author. This alteration needs to balance privacy and utility. While strong obfuscation techniques can effectively hide the author's identity, they often degrade the quality and usefulness of the text for its intended purpose. Conversely, maintaining high utility tends to provide insufficient privacy, making it easier for an adversary to de-anonymize the author. Thus, achieving an optimal trade-off between these two conflicting objectives is crucial. In this paper, we propose TAROT: Task-Oriented Authorship Obfuscation Using Policy Optimization, a new unsupervised authorship obfuscation method whose goal is to optimize the privacy-utility trade-off by regenerating the entire text considering its downstream utility. Our approach leverages policy optimization as a fine-tuning paradigm over small language models in order to rewrite texts by preserving author identity and downstream task utility. We show that our approach largely reduces the accuracy of attackers while preserving utility. We make our code and models publicly available.
Paper Structure (50 sections, 5 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 50 sections, 5 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of the two versions of TAROT: We generate obfuscation candidates and optimize the best policy using reinforcement learning and preference optimization.
  • Figure 2: Authorship adversarial training accuracy results on IMDB-10 (lower is better). Generation models are resistant to adversarial training, compared to text edition methods.
  • Figure 3: Utility classifier accuracy once trained on IMDB-10 obfuscated texts (higher is better). The red line indicates the classifier accuracy when trained and evaluated on original data. The overall utility always increases after training on obfuscated texts, this is key to compensate the utility drop of generation methods.
  • Figure 4: Adversarial training accuracy results (lower is better).
  • Figure 5: Utility classifier accuracy once trained on obfuscated texts (higher is better). The red line indicates the classifier accuracy when trained and evaluated on original data.