Table of Contents
Fetching ...

Harvesting Textual and Contrastive Data from the HAL Publication Repository

Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary

TL;DR

Topical confounds hinder true stylometry; HALvest provides a 17B-token multilingual corpus and HALvest-Contrastive enables controlled contrastive training. The study compares unrestricted (topic-rich) and base (topic-decoupled) triplets, plus ICT and BM25 baselines, to test whether restricting topic variation yields more robust authorial signals. Key finding: restricted-triplet neural models outperform lexical baselines and ICT across conditions, suggesting distributional style beyond surface tokens can be captured with topic decoupling. The authors release the datasets and code, enabling further cross-lingual stylometry research and robust authorship attribution applications.

Abstract

Authorship attribution in natural language processing traditionally struggles to distinguish genuine stylistic signals from topical confounds. While contrastive learning approaches have addressed this by maximizing semantic overlap between positive pairs, creating large-scale datasets under strict topic constraints remains challenging. We introduce HALvest, a 17-billion-token multilingual corpus harvested from 778k open-access academic papers, and HALvest-Contrastive, a derived dataset designed to isolate stylometric signals through controlled topic variation. Unlike prior work that minimizes lexical overlap, we exploit natural topic drift between papers by the same author, treating residual lexical patterns as authorial fingerprints rather than noise. Comparing lexical baselines (BM25) against neural models trained on unrestricted (topic-rich) versus base (topic-decoupled) triplets, we demonstrate that models trained exclusively on topic-decoupled data achieve superior performance across all test conditions, outperforming both retrieval baselines and models exposed to topic-rich training data. Our analysis reveals that while lexical signals provide substantial performance gains for keyword-driven methods, neural architectures learn robust stylometric representations that plateau with moderate context length, suggesting they capture distributional style beyond surface-level tokens. Both datasets and code are publicly available.

Harvesting Textual and Contrastive Data from the HAL Publication Repository

TL;DR

Topical confounds hinder true stylometry; HALvest provides a 17B-token multilingual corpus and HALvest-Contrastive enables controlled contrastive training. The study compares unrestricted (topic-rich) and base (topic-decoupled) triplets, plus ICT and BM25 baselines, to test whether restricting topic variation yields more robust authorial signals. Key finding: restricted-triplet neural models outperform lexical baselines and ICT across conditions, suggesting distributional style beyond surface tokens can be captured with topic decoupling. The authors release the datasets and code, enabling further cross-lingual stylometry research and robust authorship attribution applications.

Abstract

Authorship attribution in natural language processing traditionally struggles to distinguish genuine stylistic signals from topical confounds. While contrastive learning approaches have addressed this by maximizing semantic overlap between positive pairs, creating large-scale datasets under strict topic constraints remains challenging. We introduce HALvest, a 17-billion-token multilingual corpus harvested from 778k open-access academic papers, and HALvest-Contrastive, a derived dataset designed to isolate stylometric signals through controlled topic variation. Unlike prior work that minimizes lexical overlap, we exploit natural topic drift between papers by the same author, treating residual lexical patterns as authorial fingerprints rather than noise. Comparing lexical baselines (BM25) against neural models trained on unrestricted (topic-rich) versus base (topic-decoupled) triplets, we demonstrate that models trained exclusively on topic-decoupled data achieve superior performance across all test conditions, outperforming both retrieval baselines and models exposed to topic-rich training data. Our analysis reveals that while lexical signals provide substantial performance gains for keyword-driven methods, neural architectures learn robust stylometric representations that plateau with moderate context length, suggesting they capture distributional style beyond surface-level tokens. Both datasets and code are publicly available.
Paper Structure (30 sections, 3 figures, 6 tables)

This paper contains 30 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The dataset's different configurations. From top to bottom: full text of an extracted paper from HAL, unrestricted triplets, base triplets, and inverse cloze task (ICT). Positives are in green and negatives in red. In the unrestricted configuration, positive spans can be sampled from the same document as the query.
  • Figure 2: Two fine-tuned language models, in blue and red, respectively trained on unrestricted and base data. Plain lines track performance on unrestricted test data.
  • Figure 3: Signal and noise on every test splits before scaling. We split 10,000 examples by word and compute the Jaccard similarity between the query and the positive, as well as the query and the negative. We define the signal as the difference between the query/positive and the query/negative overlaps. The noise is the Jaccard similarity between the positive and the negative of a triplet.