Interpretable Text Embeddings and Text Similarity Explanation: A Survey
Juri Opitz, Lucas Möller, Andrianos Michail, Sebastian Padó, Simon Clematide
TL;DR
This survey addresses the interpretability and explainability of text embeddings and their pairwise similarities, a critical yet under-explored area given the practical and regulatory demand for transparent AI. It presents a taxonomy that divides methods into inherently interpretable embeddings (space shaping, sparsity, structured objects, and set-based representations) and post-hoc explanations (interaction attribution, global explainability, and surrogate modeling). Key contributions include a detailed synthesis of ideas, concrete examples (e.g., QA-based features, box embeddings, ColBERT-style token alignments), evaluation considerations, and a discussion of trade-offs and open challenges. The work highlights how various interpretable approaches can be transferred to modern decoder-based embedding models and underscores the importance of context-aware, multi-faceted explanations for trustworthy deployment across domains and languages.
Abstract
Text embeddings are a fundamental component in many NLP tasks, including classification, regression, clustering, and semantic search. However, despite their ubiquitous application, challenges persist in interpreting embeddings and explaining similarities between them. In this work, we provide a structured overview of methods specializing in inherently interpretable text embeddings and text similarity explanation, an underexplored research area. We characterize the main ideas, approaches, and trade-offs. We compare means of evaluation, discuss overarching lessons learned and finally identify opportunities and open challenges for future research.
