Table of Contents
Fetching ...

Interpretable Text Embeddings and Text Similarity Explanation: A Survey

Juri Opitz, Lucas Möller, Andrianos Michail, Sebastian Padó, Simon Clematide

TL;DR

This survey addresses the interpretability and explainability of text embeddings and their pairwise similarities, a critical yet under-explored area given the practical and regulatory demand for transparent AI. It presents a taxonomy that divides methods into inherently interpretable embeddings (space shaping, sparsity, structured objects, and set-based representations) and post-hoc explanations (interaction attribution, global explainability, and surrogate modeling). Key contributions include a detailed synthesis of ideas, concrete examples (e.g., QA-based features, box embeddings, ColBERT-style token alignments), evaluation considerations, and a discussion of trade-offs and open challenges. The work highlights how various interpretable approaches can be transferred to modern decoder-based embedding models and underscores the importance of context-aware, multi-faceted explanations for trustworthy deployment across domains and languages.

Abstract

Text embeddings are a fundamental component in many NLP tasks, including classification, regression, clustering, and semantic search. However, despite their ubiquitous application, challenges persist in interpreting embeddings and explaining similarities between them. In this work, we provide a structured overview of methods specializing in inherently interpretable text embeddings and text similarity explanation, an underexplored research area. We characterize the main ideas, approaches, and trade-offs. We compare means of evaluation, discuss overarching lessons learned and finally identify opportunities and open challenges for future research.

Interpretable Text Embeddings and Text Similarity Explanation: A Survey

TL;DR

This survey addresses the interpretability and explainability of text embeddings and their pairwise similarities, a critical yet under-explored area given the practical and regulatory demand for transparent AI. It presents a taxonomy that divides methods into inherently interpretable embeddings (space shaping, sparsity, structured objects, and set-based representations) and post-hoc explanations (interaction attribution, global explainability, and surrogate modeling). Key contributions include a detailed synthesis of ideas, concrete examples (e.g., QA-based features, box embeddings, ColBERT-style token alignments), evaluation considerations, and a discussion of trade-offs and open challenges. The work highlights how various interpretable approaches can be transferred to modern decoder-based embedding models and underscores the importance of context-aware, multi-faceted explanations for trustworthy deployment across domains and languages.

Abstract

Text embeddings are a fundamental component in many NLP tasks, including classification, regression, clustering, and semantic search. However, despite their ubiquitous application, challenges persist in interpreting embeddings and explaining similarities between them. In this work, we provide a structured overview of methods specializing in inherently interpretable text embeddings and text similarity explanation, an underexplored research area. We characterize the main ideas, approaches, and trade-offs. We compare means of evaluation, discuss overarching lessons learned and finally identify opportunities and open challenges for future research.

Paper Structure

This paper contains 55 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: A schema of a standard text encoder architecture with the different interpretable embeddings and explainability approaches, each corresponding to subsections in the text.
  • Figure 2: In S3BERT space decomposition, an overall $sim$=0.76 for the sentence pair Two men are singing and Three men are singing emerges from aggregating per-aspect similarities. (Simplified aspect set used here.)
  • Figure 3: An example of a late-interaction matrix between query and passage token embeddings in the ColBERTv2.0 model. The overall $sim$ is 0.965. Red boxes indicate row-wise maxima (alignment).
  • Figure 4: Interaction attributions between two sentences computed with the IJ method. The $sim$ is 0.618 and the measurable attribution error is 0.001.