Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach
Gideon Yoffe, Nachum Dershowitz, Ariel Vishne, Barak Sober
TL;DR
This work introduces a data-centric hypothesis-testing framework to quantify how sequentially correlated literary properties, such as theme, influence text classification. By modeling label sequences as correlated random variables and constructing a null distribution that preserves estimated autocovariances via a Toeplitz matrix, the authors test whether classification results reflect genuine stylistic signals or are confounded by thematic coherence. Across a diverse English prose corpus and multiple embedding regimes (traditional tf-idf, Delta, and STAR neural embeddings) in supervised and unsupervised settings, the method reveals that sequential correlations can inflate false positives, particularly for supervised and neural models, while unsupervised traditional features often yield robust, interpretable style signals. The approach enhances interpretability and reliability in stylometric analyses, with implications for authorship attribution and forensic linguistics, and provides a principled path to disentangle theme from style in text classification.
Abstract
We introduce a data-centric hypothesis-testing framework to quantify the influence of sequentially correlated literary properties--such as thematic continuity--on textual classification tasks. Our method models label sequences as stochastic processes and uses an empirical autocovariance matrix to generate surrogate labelings that preserve sequential dependencies. This enables statistical testing to determine whether classification outcomes are primarily driven by thematic structure or by non-sequential features like authorial style. Applying this framework across a diverse corpus of English prose, we compare traditional (word n-grams and character k-mers) and neural (contrastively trained) embeddings in both supervised and unsupervised classification settings. Crucially, our method identifies when classifications are confounded by sequentially correlated similarity, revealing that supervised and neural models are more prone to false positives--mistaking shared themes and cross-genre differences for stylistic signals. In contrast, unsupervised models using traditional features often yield high true positive rates with minimal false positives, especially in genre-consistent settings. By disentangling sequential from non-sequential influences, our approach provides a principled way to assess and interpret classification reliability. This is particularly impactful for authorship attribution, forensic linguistics, and the analysis of redacted or composite texts, where conventional methods may conflate theme with style. Our results demonstrate that controlling for sequential correlation is essential for reducing false positives and ensuring that classification outcomes reflect genuine stylistic distinctions.
