Harvesting Textual and Contrastive Data from the HAL Publication Repository
Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary
TL;DR
Topical confounds hinder true stylometry; HALvest provides a 17B-token multilingual corpus and HALvest-Contrastive enables controlled contrastive training. The study compares unrestricted (topic-rich) and base (topic-decoupled) triplets, plus ICT and BM25 baselines, to test whether restricting topic variation yields more robust authorial signals. Key finding: restricted-triplet neural models outperform lexical baselines and ICT across conditions, suggesting distributional style beyond surface tokens can be captured with topic decoupling. The authors release the datasets and code, enabling further cross-lingual stylometry research and robust authorship attribution applications.
Abstract
Authorship attribution in natural language processing traditionally struggles to distinguish genuine stylistic signals from topical confounds. While contrastive learning approaches have addressed this by maximizing semantic overlap between positive pairs, creating large-scale datasets under strict topic constraints remains challenging. We introduce HALvest, a 17-billion-token multilingual corpus harvested from 778k open-access academic papers, and HALvest-Contrastive, a derived dataset designed to isolate stylometric signals through controlled topic variation. Unlike prior work that minimizes lexical overlap, we exploit natural topic drift between papers by the same author, treating residual lexical patterns as authorial fingerprints rather than noise. Comparing lexical baselines (BM25) against neural models trained on unrestricted (topic-rich) versus base (topic-decoupled) triplets, we demonstrate that models trained exclusively on topic-decoupled data achieve superior performance across all test conditions, outperforming both retrieval baselines and models exposed to topic-rich training data. Our analysis reveals that while lexical signals provide substantial performance gains for keyword-driven methods, neural architectures learn robust stylometric representations that plateau with moderate context length, suggesting they capture distributional style beyond surface-level tokens. Both datasets and code are publicly available.
