Explaining Text Similarity in Transformer Models
Alexandros Vasileiou, Oliver Eberle
TL;DR
This work addresses the challenge of explaining Transformer-based similarity models in NLP by introducing BiLRP, a second-order attribution method tailored for bilinear similarity, which reveals how token interactions drive predictions. The authors formulate propagation rules for Transformers to produce faithful, relevance-conserving explanations and validate them with toy interaction tasks, perturbation analyses, and corpus-level studies across semantic, multilingual, and biomedical domains. Key findings show BiLRP more accurately identifies task-relevant interactions than baselines, that token matching can dominate in non-finetuned settings, and that pooling choices significantly influence explanatory patterns. The study demonstrates the practical value of structured, interaction-level explanations for understanding and improving corpus-scale similarity tasks and highlights implications for safe deployment of foundation-model-based systems.
Abstract
As Transformers have become state-of-the-art models for natural language processing (NLP) tasks, the need to understand and explain their predictions is increasingly apparent. Especially in unsupervised applications, such as information retrieval tasks, similarity models built on top of foundation model representations have been widely applied. However, their inner prediction mechanisms have mostly remained opaque. Recent advances in explainable AI have made it possible to mitigate these limitations by leveraging improved explanations for Transformers through layer-wise relevance propagation (LRP). Using BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, we investigate which feature interactions drive similarity in NLP models. We validate the resulting explanations and demonstrate their utility in three corpus-level use cases, analyzing grammatical interactions, multilingual semantics, and biomedical text retrieval. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
