Explaining Text Similarity in Transformer Models

Alexandros Vasileiou; Oliver Eberle

Explaining Text Similarity in Transformer Models

Alexandros Vasileiou, Oliver Eberle

TL;DR

This work addresses the challenge of explaining Transformer-based similarity models in NLP by introducing BiLRP, a second-order attribution method tailored for bilinear similarity, which reveals how token interactions drive predictions. The authors formulate propagation rules for Transformers to produce faithful, relevance-conserving explanations and validate them with toy interaction tasks, perturbation analyses, and corpus-level studies across semantic, multilingual, and biomedical domains. Key findings show BiLRP more accurately identifies task-relevant interactions than baselines, that token matching can dominate in non-finetuned settings, and that pooling choices significantly influence explanatory patterns. The study demonstrates the practical value of structured, interaction-level explanations for understanding and improving corpus-scale similarity tasks and highlights implications for safe deployment of foundation-model-based systems.

Abstract

As Transformers have become state-of-the-art models for natural language processing (NLP) tasks, the need to understand and explain their predictions is increasingly apparent. Especially in unsupervised applications, such as information retrieval tasks, similarity models built on top of foundation model representations have been widely applied. However, their inner prediction mechanisms have mostly remained opaque. Recent advances in explainable AI have made it possible to mitigate these limitations by leveraging improved explanations for Transformers through layer-wise relevance propagation (LRP). Using BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, we investigate which feature interactions drive similarity in NLP models. We validate the resulting explanations and demonstrate their utility in three corpus-level use cases, analyzing grammatical interactions, multilingual semantics, and biomedical text retrieval. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.

Explaining Text Similarity in Transformer Models

TL;DR

Abstract

Paper Structure (33 sections, 3 equations, 9 figures, 5 tables)

This paper contains 33 sections, 3 equations, 9 figures, 5 tables.

Introduction
Related Work
Explainable AI for Similarity Models
Explainable AI for Similarity Models
Explainable AI for Transformers
Propagation rules
Experiments
Data
Similarity Models
Transformers
Pooling
Evaluation of Explanations
Interaction Analysis
Perturbation Analysis
Conservation
...and 18 more sections

Figures (9)

Figure 1: Comparison of different explanation techniques that highlight the interaction between input features. Ground truth interactions (top row) are the interactions between same noun tokens. These are compared to second-order explanations built on top of BERT token embeddings, Hessian$\times$Product (H$\times$P) and BiLRP. Average cosine similarity (ACS) is used to measure agreement between ground truth and explanations.
Figure 2: Perturbation experiment comparing different explanation methods across models. Fractions of tokens, ranked from most to least relevant, are added to one input sequence and the resulting Euclidean distance to the unperturbed sentence is measured. A steep initial decline with a smaller area under the curve indicates better identification of task-relevant features.
Figure 3: Corpus-level analysis of BiLRP explanations between POS tags on the STSb dataset. The contribution of positive/negative interactions to the similarity score is shown in red/blue for three similarity models, ranging from (a) the least predictive (BERT + CLS), to (b) moderately predictive (BERT + Mean Pooling), to (c) the most predictive (SBERT) (cf. Table \ref{['table:stsb_scores']}).
Figure 4: Comparison of mono- and multilingual BERT-based similarity models on mSTSb. (a) Spearman correlation $\rho \times 100$ of the multilingual STSb corpus. Similarity models are build from monolingual (mono) and multilingual (multi) BERT models that receive monolingual input, and a multilingual model that receives mixed input in English and a translated version of the other sentence (mix-multi). (b) BiLRP explanations on mBERT for English-English (left), German-German (center) and English-German (right). The sentence pair is assigned a true similarity score of 0.85. (c) Comparison of positively relevant POS interactions. POS tags are selected based on largest difference of accumulated relevance between the mixed and the monolingual settings.
Figure 5: Analysis of semantic similarity on the BIOSSES dataset containing biomedical text. Mean squared error (MSE) between predicted and true similarity for SGPT, SBERT and BERT similarity model is shown (top). Top-5 most relevant token interactions are shown for high and low similarity levels (bottom).
...and 4 more figures

Explaining Text Similarity in Transformer Models

TL;DR

Abstract

Explaining Text Similarity in Transformer Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)