Separating Style from Substance: Enhancing Cross-Genre Authorship Attribution through Data Selection and Presentation
Steven Fincke, Elizabeth Boschee
TL;DR
The paper tackles cross-genre authorship attribution by introducing Sadiri, a RoBERTa-based encoder trained with supervised contrastive loss and a data-curriculum that emphasizes stylistic cues over topical similarity. Core innovations include hard-positive mining (topically distant same-author pairs) and hard-negative batching (clustered, harder negatives within batches) to deter topic leakage. Empirical results on the hiatus HiRS dataset show substantial improvements, particularly in cross-genre settings (≈62.7% relative gain) and meaningful gains in per-genre contexts (≈16.6%), with strong gains amplified by multi-genre training. The work highlights data selection and presentation as the critical drivers of cross-genre stylometry, offering a practical path to robust authorship attribution across genres without relying on topic-specific features.
Abstract
The task of deciding whether two documents are written by the same author is challenging for both machines and humans. This task is even more challenging when the two documents are written about different topics (e.g. baseball vs. politics) or in different genres (e.g. a blog post vs. an academic article). For machines, the problem is complicated by the relative lack of real-world training examples that cross the topic boundary and the vanishing scarcity of cross-genre data. We propose targeted methods for training data selection and a novel learning curriculum that are designed to discourage a model's reliance on topic information for authorship attribution and correspondingly force it to incorporate information more robustly indicative of style no matter the topic. These refinements yield a 62.7% relative improvement in average cross-genre authorship attribution, as well as 16.6% in the per-genre condition.
