Table of Contents
Fetching ...

Latent Space Interpretation for Stylistic Analysis and Explainable Authorship Attribution

Milad Alshomary, Narutatsu Ri, Marianna Apidianaki, Ajay Patel, Smaranda Muresan, Kathleen McKeown

TL;DR

The paper tackles the opacity of latent embeddings in state-of-the-art authorship attribution systems and proposes a bottom-up latent-space interpretation by clustering author embeddings into a small set of representative points that are mapped to distributions over writing-style features generated by LLMs. It demonstrates strong alignment with the original latent space (Pearson $r=0.79$), provides human-validated style descriptions (72% preference), and shows an average $+20\%$ accuracy improvement on the AA task when explanations are available. The approach combines clustering, large-language-model–derived style descriptors, and user studies to establish both the plausibility of the explanations and their practical utility for improving explainability and human performance in authorship attribution. Overall, it offers a scalable, interpretable framework for understanding and validating latent representations in AA models, with potential impact on forensic linguistics and real-world document authorship analysis.

Abstract

Recent state-of-the-art authorship attribution methods learn authorship representations of texts in a latent, non-interpretable space, hindering their usability in real-world applications. Our work proposes a novel approach to interpreting these learned embeddings by identifying representative points in the latent space and utilizing LLMs to generate informative natural language descriptions of the writing style of each point. We evaluate the alignment of our interpretable space with the latent one and find that it achieves the best prediction agreement compared to other baselines. Additionally, we conduct a human evaluation to assess the quality of these style descriptions, validating their utility as explanations for the latent space. Finally, we investigate whether human performance on the challenging AA task improves when aided by our system's explanations, finding an average improvement of around +20% in accuracy.

Latent Space Interpretation for Stylistic Analysis and Explainable Authorship Attribution

TL;DR

The paper tackles the opacity of latent embeddings in state-of-the-art authorship attribution systems and proposes a bottom-up latent-space interpretation by clustering author embeddings into a small set of representative points that are mapped to distributions over writing-style features generated by LLMs. It demonstrates strong alignment with the original latent space (Pearson ), provides human-validated style descriptions (72% preference), and shows an average accuracy improvement on the AA task when explanations are available. The approach combines clustering, large-language-model–derived style descriptors, and user studies to establish both the plausibility of the explanations and their practical utility for improving explainability and human performance in authorship attribution. Overall, it offers a scalable, interpretable framework for understanding and validating latent representations in AA models, with potential impact on forensic linguistics and real-world document authorship analysis.

Abstract

Recent state-of-the-art authorship attribution methods learn authorship representations of texts in a latent, non-interpretable space, hindering their usability in real-world applications. Our work proposes a novel approach to interpreting these learned embeddings by identifying representative points in the latent space and utilizing LLMs to generate informative natural language descriptions of the writing style of each point. We evaluate the alignment of our interpretable space with the latent one and find that it achieves the best prediction agreement compared to other baselines. Additionally, we conduct a human evaluation to assess the quality of these style descriptions, validating their utility as explanations for the latent space. Finally, we investigate whether human performance on the challenging AA task improves when aided by our system's explanations, finding an average improvement of around +20% in accuracy.
Paper Structure (40 sections, 1 equation, 11 figures, 5 tables)

This paper contains 40 sections, 1 equation, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Our approach for explaining authorship attribution predictions. We identify $k$ clusters with centroids $p_1, \dots, p_k$ in the embedding space and associate each with writing style features. The writing style of a document $D_i$ is explained by aggregating the style features of its closest cluster.
  • Figure 2: Our approach: Given the training corpus $D^\textup{train}$ with documents from authors $A_1, \dots, A_m$, we generate style descriptions for each document to construct the style corpus. We then identify relevant regions in the latent space $p_1, \dots, p_k$ by clustering author-level representations and aggregate style features to obtain style representations for each region.
  • Figure 3: Performance comparison of cluster assignments by number of clusters. Smaller EER and larger AP indicate better performance. Note that both metrics naturally favor assignments with more clusters.
  • Figure 4: Refinement steps applied on style description distilled from llama3-8b.
  • Figure 5: Illustration of style features extracted from the training corpus using our approach. 1470 features were extracted for our dataset.
  • ...and 6 more figures