Table of Contents
Fetching ...

Inference through innovation processes tested in the authorship attribution task

Giulio Tani Raffaelli, Margherita Lalli, Francesca Tria

TL;DR

This work addresses authorship attribution for symbolic sequences by leveraging urn models for innovation, specifically the urn model with triggering (UMT) which is connected to the Poisson-Dirichlet process. It introduces a direct, non-sampling inference approach (CP2D and its $oldsymbol{\delta}$-variant) using a non-atomic base distribution $P_0$ and author-dependent adjustments to quantify the likelihood that an anonymous text $T$ continues a given author's production, $P(T|A)$. The method yields high accuracy across diverse corpora (literary and informal) and offers computational efficiency and transparency over traditional hierarchical Bayesian methods, with robustness to language variation and text length; it also relaxes exchangeability to accommodate time-dependent correlations. The results suggest broad applicability beyond authorship attribution and point to future integration with semantic triggering and open-set scenarios, while maintaining a strong nonparametric Bayesian connection.

Abstract

Urn models for innovation capture fundamental empirical laws shared by several real-world processes. The so-called urn model with triggering includes, as particular cases, the urn representation of the two-parameter Poisson-Dirichlet process and the Dirichlet process, seminal in Bayesian non-parametric inference. In this work, we leverage this connection to introduce a general approach for quantifying closeness between symbolic sequences and test it within the framework of the authorship attribution problem. The method demonstrates high accuracy when compared to other related methods in different scenarios, featuring a substantial gain in computational efficiency and theoretical transparency. Beyond the practical convenience, this work demonstrates how the recently established connection between urn models and non-parametric Bayesian inference can pave the way for designing more efficient inference methods. In particular, the hybrid approach that we propose allows us to relax the exchangeability hypothesis, which can be particularly relevant for systems exhibiting complex correlation patterns and non-stationary dynamics.

Inference through innovation processes tested in the authorship attribution task

TL;DR

This work addresses authorship attribution for symbolic sequences by leveraging urn models for innovation, specifically the urn model with triggering (UMT) which is connected to the Poisson-Dirichlet process. It introduces a direct, non-sampling inference approach (CP2D and its -variant) using a non-atomic base distribution and author-dependent adjustments to quantify the likelihood that an anonymous text continues a given author's production, . The method yields high accuracy across diverse corpora (literary and informal) and offers computational efficiency and transparency over traditional hierarchical Bayesian methods, with robustness to language variation and text length; it also relaxes exchangeability to accommodate time-dependent correlations. The results suggest broad applicability beyond authorship attribution and point to future integration with semantic triggering and open-set scenarios, while maintaining a strong nonparametric Bayesian connection.

Abstract

Urn models for innovation capture fundamental empirical laws shared by several real-world processes. The so-called urn model with triggering includes, as particular cases, the urn representation of the two-parameter Poisson-Dirichlet process and the Dirichlet process, seminal in Bayesian non-parametric inference. In this work, we leverage this connection to introduce a general approach for quantifying closeness between symbolic sequences and test it within the framework of the authorship attribution problem. The method demonstrates high accuracy when compared to other related methods in different scenarios, featuring a substantial gain in computational efficiency and theoretical transparency. Beyond the practical convenience, this work demonstrates how the recently established connection between urn models and non-parametric Bayesian inference can pave the way for designing more efficient inference methods. In particular, the hybrid approach that we propose allows us to relax the exchangeability hypothesis, which can be particularly relevant for systems exhibiting complex correlation patterns and non-stationary dynamics.
Paper Structure (17 sections, 6 equations, 2 figures, 1 table)

This paper contains 17 sections, 6 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Corpora sizes and the impact of model parameters on attribution accuracy. In panel (a) we offer a pictorial view of various characteristics related to the size of the considered corpora. The size of the triangles is proportional to the logarithm of the corpus size, measured as number of documents. In the $x$ and $y$ axes we represent for each corpus the distribution of the numbers of texts ($x$ axis) and of the numbers of characters ($y$ axis) per author. Specifically, the continuous line bars represent the interquartile range of the distributions and the dotted lines show the 95% interval, to highlight their long tails. Panels (b) to (f) report the attribution accuracy varying the length of the fragments and the $\delta$ value. The colour scale refers to the difference relative to the maximum attribution accuracy obtained in each dataset. In the upper band, the considered length of fragments is of a single token. In the lower band, the text is not partitioned in fragments (Full Text).
  • Figure 2: Attribution accuracy. For each of the considered datasets and attribution methods, thick lines show the average accuracy in the ten-fold stratified cross-validation experiment, while shaded circles refer to the attribution accuracy on each of the ten test sets separately. An exception is the E-mail dataset, where a unique test set is considered (see main text). We compare the accuracy achieved by our method (the Constrained Probability 2-parameters Poisson-Dirichlet, in both its versions with and without including the parameter $\delta$: the CP2D and $\delta$-CP2D) with the Cross-Entropy based approach (CE), the Latent Dirichlet Allocation plus Hellinger distance (LDA-H), the Disjoint Author-Document Topic model in its Probabilistic formulation (DADT-P), and the Topic Drift Model (TDM). On the literary corpora, the LDA-H accuracy is computed using our implementation; please refer to Supplementary Methods for details. For the informal corpora, the results are are available from a previous study seroussi2012authorship. Results for the DADT-P and the TDM algorithms were available in the works by Seroussi et al. seroussi2014authorship and Yang et al. yang2017authorship, respectively.