Inference through innovation processes tested in the authorship attribution task
Giulio Tani Raffaelli, Margherita Lalli, Francesca Tria
TL;DR
This work addresses authorship attribution for symbolic sequences by leveraging urn models for innovation, specifically the urn model with triggering (UMT) which is connected to the Poisson-Dirichlet process. It introduces a direct, non-sampling inference approach (CP2D and its $oldsymbol{\delta}$-variant) using a non-atomic base distribution $P_0$ and author-dependent adjustments to quantify the likelihood that an anonymous text $T$ continues a given author's production, $P(T|A)$. The method yields high accuracy across diverse corpora (literary and informal) and offers computational efficiency and transparency over traditional hierarchical Bayesian methods, with robustness to language variation and text length; it also relaxes exchangeability to accommodate time-dependent correlations. The results suggest broad applicability beyond authorship attribution and point to future integration with semantic triggering and open-set scenarios, while maintaining a strong nonparametric Bayesian connection.
Abstract
Urn models for innovation capture fundamental empirical laws shared by several real-world processes. The so-called urn model with triggering includes, as particular cases, the urn representation of the two-parameter Poisson-Dirichlet process and the Dirichlet process, seminal in Bayesian non-parametric inference. In this work, we leverage this connection to introduce a general approach for quantifying closeness between symbolic sequences and test it within the framework of the authorship attribution problem. The method demonstrates high accuracy when compared to other related methods in different scenarios, featuring a substantial gain in computational efficiency and theoretical transparency. Beyond the practical convenience, this work demonstrates how the recently established connection between urn models and non-parametric Bayesian inference can pave the way for designing more efficient inference methods. In particular, the hybrid approach that we propose allows us to relax the exchangeability hypothesis, which can be particularly relevant for systems exhibiting complex correlation patterns and non-stationary dynamics.
