Table of Contents
Fetching ...

From Small to Large Language Models: Revisiting the Federalist Papers

So Won Jeong, Veronika Ročková

TL;DR

This study evaluates whether off-the-shelf Large Language Model embeddings improve authorship attribution on the Federalist Papers relative to traditional small-language models. It systematically compares BoW-based topic embeddings (via LDA) and continuous LLM embeddings (from BERT, RoBERTa, GPT, Llama) using LASSO and BART classifiers, and analyzes thresholding with ROC and F1 criteria. The main finding is that LDA-based topic embeddings paired with a Bayesian classifier (BART) achieve the best out-of-sample accuracy, while larger, generic embeddings often underperform due to focusing on semantic content rather than stylistic markers like function words. The results reinforce the value of traditional stylometry methods for interpretable attribution, yet suggest practical guidelines for integrating LLMs with established statistical models when applying to nuanced authorship tasks. Overall, the work highlights dimension-reduction approaches as robust, interpretable, and competitive alternatives to large-scale embeddings in targeted text classification problems.

Abstract

For a long time, the authorship of the Federalist Papers had been a subject of inquiry and debate, not only by linguists and historians but also by statisticians. In what was arguably the first Bayesian case study, Mosteller and Wallace (1963) provided the first statistical evidence for attributing all disputed papers to Madison. Our paper revisits this historical dataset but from a lens of modern language models, both small and large. We review some of the more popular Large Language Model (LLM) tools and examine them from a statistical point of view in the context of text classification. We investigate whether, without any attempt to fine-tune, the general embedding constructs can be useful for stylometry and attribution. We explain differences between various word/phrase embeddings and discuss how to aggregate them in a document. Contrary to our expectations, we exemplify that dimension expansion with word embeddings may not always be beneficial for attribution relative to dimension reduction with topic embeddings. Our experiments demonstrate that default LLM embeddings (even after manual fine-tuning) may not consistently improve authorship attribution accuracy. Instead, Bayesian analysis with topic embeddings trained on ``function words" yields superior out-of-sample classification performance. This suggests that traditional (small) statistical language models, with their interpretability and solid theoretical foundation, can offer significant advantages in authorship attribution tasks. The code used in this analysis is available at github.com/sowonjeong/slm-to-llm

From Small to Large Language Models: Revisiting the Federalist Papers

TL;DR

This study evaluates whether off-the-shelf Large Language Model embeddings improve authorship attribution on the Federalist Papers relative to traditional small-language models. It systematically compares BoW-based topic embeddings (via LDA) and continuous LLM embeddings (from BERT, RoBERTa, GPT, Llama) using LASSO and BART classifiers, and analyzes thresholding with ROC and F1 criteria. The main finding is that LDA-based topic embeddings paired with a Bayesian classifier (BART) achieve the best out-of-sample accuracy, while larger, generic embeddings often underperform due to focusing on semantic content rather than stylistic markers like function words. The results reinforce the value of traditional stylometry methods for interpretable attribution, yet suggest practical guidelines for integrating LLMs with established statistical models when applying to nuanced authorship tasks. Overall, the work highlights dimension-reduction approaches as robust, interpretable, and competitive alternatives to large-scale embeddings in targeted text classification problems.

Abstract

For a long time, the authorship of the Federalist Papers had been a subject of inquiry and debate, not only by linguists and historians but also by statisticians. In what was arguably the first Bayesian case study, Mosteller and Wallace (1963) provided the first statistical evidence for attributing all disputed papers to Madison. Our paper revisits this historical dataset but from a lens of modern language models, both small and large. We review some of the more popular Large Language Model (LLM) tools and examine them from a statistical point of view in the context of text classification. We investigate whether, without any attempt to fine-tune, the general embedding constructs can be useful for stylometry and attribution. We explain differences between various word/phrase embeddings and discuss how to aggregate them in a document. Contrary to our expectations, we exemplify that dimension expansion with word embeddings may not always be beneficial for attribution relative to dimension reduction with topic embeddings. Our experiments demonstrate that default LLM embeddings (even after manual fine-tuning) may not consistently improve authorship attribution accuracy. Instead, Bayesian analysis with topic embeddings trained on ``function words" yields superior out-of-sample classification performance. This suggests that traditional (small) statistical language models, with their interpretability and solid theoretical foundation, can offer significant advantages in authorship attribution tasks. The code used in this analysis is available at github.com/sowonjeong/slm-to-llm

Paper Structure

This paper contains 44 sections, 27 equations, 15 figures, 23 tables.

Figures (15)

  • Figure 1: BART classification probability based on document embeddings. The red density is the kernel density estimate of predicted probabilities of BART for papers authored by Hamilton, and the blue density is the kernel density estimate of the ones by Madison. The predicted probabilities of disputed papers are denoted as green vertical lines. For LDA, the results are based on the word counts of "functions words" as an input. The well-separated densities between Hamilton and Madison indicate less uncertainty for the prediction on the disputed papers.
  • Figure 2: Stylized representation of the relationships among language models. Each model learns a function $f: \mathbf{W}_{[1:T]} \to \mathbb{R}^p$, mapping a word sequence to a latent representation. LDA (red) is a probabilistic model, while LSA and NMF (orange, yellow) use numerical methods like matrix factorization. Word2Vec and GloVe (green) introduce shallow neural networks, followed by recurrent models (RNN, LSTM) and transformer-based models (green), which capture long-range dependencies. Modern deep learning methods (blue) rely on large neural networks with autoencoding and autoregressive objectives. This spectrum illustrates the shift from probabilistic and numerical methods to neural architectures.
  • Figure 3: The estimated density for Hamilton and Madison using BART with BERT embeddings is shown. While the fine-tuned embeddings yield perfectly separated density estimates for the training data, the unseen documents (indicated by green vertical lines) do not fall within the estimated density regions but instead lie in the intermediate space. This suggests overfitting during the fine-tuning process.
  • Figure 4: Document-Topic distribution by LDA trained on 145 selected words (Type 3). The similarity in topic distributions for Madison authored papers and the disputed papers implies the shared stylometry among them.
  • Figure 5: Word cloud representation of significant words found by LASSO with different sets of word count matrices. Type 1 includes contextual words only, Type 2 includes both contextual words and stopwords, and Type 3 includes a curated set of words by Mosteller1963Inference. LASSO successfully recovers some of the words reported in the original study (Table \ref{['tab:function-words']}). The most discriminative words such as 'whilst' or 'upon' are consistently recovered in all three types of inputs.
  • ...and 10 more figures