Table of Contents
Fetching ...

Residualized Similarity for Faithfully Explainable Authorship Verification

Peter Zeng, Pegah Alipoormolabashi, Jihu Mun, Gourab Dey, Nikita Soni, Niranjan Balasubramanian, Owen Rambow, H. Schwartz

TL;DR

The paper tackles the need for explainable Authorship Verification by blending interpretable, text-derived Gram2vec features with a neural residual predictor. It introduces Residualized Similarity, where a neural model learns the residual between an interpretable similarity and ground-truth, producing a final score that balances accuracy with interpretability. The authors define interpretability confidence to quantify reliance on interpretable features and demonstrate that RS achieves competitive or superior AV performance across four diverse datasets while providing faithful explanations linked to textual features. This approach has practical implications for forensic linguistics and responsible AI in author attribution, enabling verifiable and traceable decisions.

Abstract

Responsible use of Authorship Verification (AV) systems not only requires high accuracy but also interpretable solutions. More importantly, for systems to be used to make decisions with real-world consequences requires the model's prediction to be explainable using interpretable features that can be traced to the original texts. Neural methods achieve high accuracies, but their representations lack direct interpretability. Furthermore, LLM predictions cannot be explained faithfully -- if there is an explanation given for a prediction, it doesn't represent the reasoning process behind the model's prediction. In this paper, we introduce Residualized Similarity (RS), a novel method that supplements systems using interpretable features with a neural network to improve their performance while maintaining interpretability. Authorship verification is fundamentally a similarity task, where the goal is to measure how alike two documents are. The key idea is to use the neural network to predict a similarity residual, i.e. the error in the similarity predicted by the interpretable system. Our evaluation across four datasets shows that not only can we match the performance of state-of-the-art authorship verification models, but we can show how and to what degree the final prediction is faithful and interpretable.

Residualized Similarity for Faithfully Explainable Authorship Verification

TL;DR

The paper tackles the need for explainable Authorship Verification by blending interpretable, text-derived Gram2vec features with a neural residual predictor. It introduces Residualized Similarity, where a neural model learns the residual between an interpretable similarity and ground-truth, producing a final score that balances accuracy with interpretability. The authors define interpretability confidence to quantify reliance on interpretable features and demonstrate that RS achieves competitive or superior AV performance across four diverse datasets while providing faithful explanations linked to textual features. This approach has practical implications for forensic linguistics and responsible AI in author attribution, enabling verifiable and traceable decisions.

Abstract

Responsible use of Authorship Verification (AV) systems not only requires high accuracy but also interpretable solutions. More importantly, for systems to be used to make decisions with real-world consequences requires the model's prediction to be explainable using interpretable features that can be traced to the original texts. Neural methods achieve high accuracies, but their representations lack direct interpretability. Furthermore, LLM predictions cannot be explained faithfully -- if there is an explanation given for a prediction, it doesn't represent the reasoning process behind the model's prediction. In this paper, we introduce Residualized Similarity (RS), a novel method that supplements systems using interpretable features with a neural network to improve their performance while maintaining interpretability. Authorship verification is fundamentally a similarity task, where the goal is to measure how alike two documents are. The key idea is to use the neural network to predict a similarity residual, i.e. the error in the similarity predicted by the interpretable system. Our evaluation across four datasets shows that not only can we match the performance of state-of-the-art authorship verification models, but we can show how and to what degree the final prediction is faithful and interpretable.

Paper Structure

This paper contains 34 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Demonstration of the task of Authorship Verification. A forensic linguist is trying to determine if two texts share the same author. They may use either an interpretable system comprising linguistic features faithful to the source text or a neural model, which has good performance but lacks interpretability. Our system combines the relative strengths of both by using a neural model to correct the error in the interpretable system’s prediction.
  • Figure 2: Residualized Similarity Architecture. To incorporate signal from the interpretable feature vectors, we add an attention layer over both the interpretable feature vectors as well as the neural embeddings from the model we're fine-tuning. Boxes colored in green indicate that they're updated during training. On the left-hand side, we show the system in use at inference time. The final similarity score is a simple sum of the interpretable cosine similarity score and the predicted residual.
  • Figure 3: The distribution of interpretability confidence scores in the predictions using residualized similarity with LUAR on the Reddit dataset.
  • Figure 4: Example Pairs for Case Study. Pair 1 is by two different authors, and Pair 2 is by the same author.