PART: Pre-trained Authorship Representation Transformer
Javier Huertas-Tato, Alejandro Martin, David Camacho
TL;DR
PART introduces a pretrained authorship representation transformer, a contrastively trained RoBERTa-based encoder designed to learn stylometric author embeddings that capture writing style rather than content. By using a dual-pool contrastive objective with NT-Xent loss and a BiLSTM on top of frozen semantic representations, PART achieves strong zero-shot attribution and competitive PAN@CLEF results across Gutenberg, Blog, and Enron datasets, including a zero-shot accuracy of $72.39\%$ at $N=250$ and top-5 of $86.63\%$. The work provides three case studies (Enron, Gutenberg, Blogs) to demonstrate that embeddings organize texts by author centroids, genre, and demographic signals, while also revealing biases and potential topic leakage. The proposed embeddings enable cross-domain authorship analysis with practical implications for forensics, mental health, and misinformation research, and the authors share code for replication.
Abstract
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. Using stylometric representations is more suitable, but this by itself is an open research challenge. In this paper, we propose PART, a contrastively trained model fit to learn \textbf{authorship embeddings} instead of semantics. We train our model on ~1.5M texts belonging to 1162 literature authors, 17287 blog posters and 135 corporate email accounts; a heterogeneous set with identifiable writing styles. We evaluate the model on current challenges, achieving competitive performance. We also evaluate our model on test splits of the datasets achieving zero-shot 72.39\% accuracy when bounded to 250 authors, a 54\% and 56\% higher than RoBERTa embeddings. We qualitatively assess the representations with different data visualizations on the available datasets, observing features such as gender, age, or occupation of the author.
