Table of Contents
Fetching ...

Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach

Gideon Yoffe, Nachum Dershowitz, Ariel Vishne, Barak Sober

TL;DR

This work introduces a data-centric hypothesis-testing framework to quantify how sequentially correlated literary properties, such as theme, influence text classification. By modeling label sequences as correlated random variables and constructing a null distribution that preserves estimated autocovariances via a Toeplitz matrix, the authors test whether classification results reflect genuine stylistic signals or are confounded by thematic coherence. Across a diverse English prose corpus and multiple embedding regimes (traditional tf-idf, Delta, and STAR neural embeddings) in supervised and unsupervised settings, the method reveals that sequential correlations can inflate false positives, particularly for supervised and neural models, while unsupervised traditional features often yield robust, interpretable style signals. The approach enhances interpretability and reliability in stylometric analyses, with implications for authorship attribution and forensic linguistics, and provides a principled path to disentangle theme from style in text classification.

Abstract

We introduce a data-centric hypothesis-testing framework to quantify the influence of sequentially correlated literary properties--such as thematic continuity--on textual classification tasks. Our method models label sequences as stochastic processes and uses an empirical autocovariance matrix to generate surrogate labelings that preserve sequential dependencies. This enables statistical testing to determine whether classification outcomes are primarily driven by thematic structure or by non-sequential features like authorial style. Applying this framework across a diverse corpus of English prose, we compare traditional (word n-grams and character k-mers) and neural (contrastively trained) embeddings in both supervised and unsupervised classification settings. Crucially, our method identifies when classifications are confounded by sequentially correlated similarity, revealing that supervised and neural models are more prone to false positives--mistaking shared themes and cross-genre differences for stylistic signals. In contrast, unsupervised models using traditional features often yield high true positive rates with minimal false positives, especially in genre-consistent settings. By disentangling sequential from non-sequential influences, our approach provides a principled way to assess and interpret classification reliability. This is particularly impactful for authorship attribution, forensic linguistics, and the analysis of redacted or composite texts, where conventional methods may conflate theme with style. Our results demonstrate that controlling for sequential correlation is essential for reducing false positives and ensuring that classification outcomes reflect genuine stylistic distinctions.

Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach

TL;DR

This work introduces a data-centric hypothesis-testing framework to quantify how sequentially correlated literary properties, such as theme, influence text classification. By modeling label sequences as correlated random variables and constructing a null distribution that preserves estimated autocovariances via a Toeplitz matrix, the authors test whether classification results reflect genuine stylistic signals or are confounded by thematic coherence. Across a diverse English prose corpus and multiple embedding regimes (traditional tf-idf, Delta, and STAR neural embeddings) in supervised and unsupervised settings, the method reveals that sequential correlations can inflate false positives, particularly for supervised and neural models, while unsupervised traditional features often yield robust, interpretable style signals. The approach enhances interpretability and reliability in stylometric analyses, with implications for authorship attribution and forensic linguistics, and provides a principled path to disentangle theme from style in text classification.

Abstract

We introduce a data-centric hypothesis-testing framework to quantify the influence of sequentially correlated literary properties--such as thematic continuity--on textual classification tasks. Our method models label sequences as stochastic processes and uses an empirical autocovariance matrix to generate surrogate labelings that preserve sequential dependencies. This enables statistical testing to determine whether classification outcomes are primarily driven by thematic structure or by non-sequential features like authorial style. Applying this framework across a diverse corpus of English prose, we compare traditional (word n-grams and character k-mers) and neural (contrastively trained) embeddings in both supervised and unsupervised classification settings. Crucially, our method identifies when classifications are confounded by sequentially correlated similarity, revealing that supervised and neural models are more prone to false positives--mistaking shared themes and cross-genre differences for stylistic signals. In contrast, unsupervised models using traditional features often yield high true positive rates with minimal false positives, especially in genre-consistent settings. By disentangling sequential from non-sequential influences, our approach provides a principled way to assess and interpret classification reliability. This is particularly impactful for authorship attribution, forensic linguistics, and the analysis of redacted or composite texts, where conventional methods may conflate theme with style. Our results demonstrate that controlling for sequential correlation is essential for reducing false positives and ensuring that classification outcomes reflect genuine stylistic distinctions.

Paper Structure

This paper contains 34 sections, 7 equations, 19 figures, 5 tables, 4 algorithms.

Figures (19)

  • Figure 1: Flowchart of the hypothesis-testing framework. Starting from an embedded text corpus $D \in \mathbb{R}^{m \times n}$, a binary label sequence $L \in \{0,1\}^m$ is obtained via 2-means clustering. The autocovariance vector $A$ and corresponding matrix $M$ are computed from $L$. A stochastic vector $V \sim \mathcal{N}(\vec{\bar{L}}, M)$ is drawn and thresholded to generate null sequences $L^{(\text{null})}$. Repeating this process yields a null distribution of MCC scores against $L$, from which a $p$-value is estimated.
  • Figure 2: Significance map for applying 2-means classification to 50 pairs of commingled texts by different authors (see Table A.2 in Appendix 1), embedded using word $n$-grams with $f$ = 300. The x-axis shows all feature combinations ($n$ and $l$), and the y-axis lists 50 text pairs. Colored cells indicate parameter combinations yielding classifications predominantly affected by non-sequentially correlated properties with high statistical significance, color-coded by normalized MCC score. Blank cells denote classifications where the hypothesis test yielded a $p$-value $>$ 0.05.
  • Figure 3: Extracted important features for the classification of the texts pair [Dickens_Copperfield, Dickens_OliverTwist], embedded using word $n$-gram size 2 with text unit length of 1000.
  • Figure 4: Average distinguishing power of traditional feature configurations across all true positive cases. Each value reflects the mean and standard deviation of the MCC scores over text pairs composed by different authors. Left Panel: Character $k$-mers ($k = 2$ to $6$). Right Panel: Word $n$-grams ($n = 1$ to $4$).
  • Figure D.1: Significance map for the attempt to apply $2$-means classification to distinguish 50 pairs of texts composed by different authors (see Table \ref{['app_tab_unsupervised_texts_differentAuthors']}), embedded using character $k$-mers with $f$ = 300, similarly to Figure \ref{['Fig_res_words']}.
  • ...and 14 more figures