Table of Contents
Fetching ...

From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures

Florian Rottach, William Rudman, Bastian Rieck, Harrisen Scells, Carsten Eickhoff

TL;DR

This work introduces Unified Topological Signatures (UTS), a holistic framework that aggregates diverse topological and geometric descriptors to characterize text embedding spaces across models and datasets. By computing global and local signatures, applying normalization and PCA, and using topology-informed predictors, the authors demonstrate that model family and architecture imprint distinctive topological fingerprints and that embedding space dimensionality strongly constrains retrieval performance. They show that global topology clusters by model family rather than size and that local topology can predict document retrievability and reveal bias in dense retrieval systems. The findings advocate for a multi-attribute, topology-driven view to understand and optimize embedding spaces, with practical implications for model selection, retrieval quality, and bias mitigation.

Abstract

Studying how embeddings are organized in space not only enhances model interpretability but also uncovers factors that drive downstream task performance. In this paper, we present a comprehensive analysis of topological and geometric measures across a wide set of text embedding models and datasets. We find a high degree of redundancy among these measures and observe that individual metrics often fail to sufficiently differentiate embedding spaces. Building on these insights, we introduce Unified Topological Signatures (UTS), a holistic framework for characterizing embedding spaces. We show that UTS can predict model-specific properties and reveal similarities driven by model architecture. Further, we demonstrate the utility of our method by linking topological structure to ranking effectiveness and accurately predicting document retrievability. We find that a holistic, multi-attribute perspective is essential to understanding and leveraging the geometry of text embeddings.

From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures

TL;DR

This work introduces Unified Topological Signatures (UTS), a holistic framework that aggregates diverse topological and geometric descriptors to characterize text embedding spaces across models and datasets. By computing global and local signatures, applying normalization and PCA, and using topology-informed predictors, the authors demonstrate that model family and architecture imprint distinctive topological fingerprints and that embedding space dimensionality strongly constrains retrieval performance. They show that global topology clusters by model family rather than size and that local topology can predict document retrievability and reveal bias in dense retrieval systems. The findings advocate for a multi-attribute, topology-driven view to understand and optimize embedding spaces, with practical implications for model selection, retrieval quality, and bias mitigation.

Abstract

Studying how embeddings are organized in space not only enhances model interpretability but also uncovers factors that drive downstream task performance. In this paper, we present a comprehensive analysis of topological and geometric measures across a wide set of text embedding models and datasets. We find a high degree of redundancy among these measures and observe that individual metrics often fail to sufficiently differentiate embedding spaces. Building on these insights, we introduce Unified Topological Signatures (UTS), a holistic framework for characterizing embedding spaces. We show that UTS can predict model-specific properties and reveal similarities driven by model architecture. Further, we demonstrate the utility of our method by linking topological structure to ranking effectiveness and accurately predicting document retrievability. We find that a holistic, multi-attribute perspective is essential to understanding and leveraging the geometry of text embeddings.

Paper Structure

This paper contains 61 sections, 31 equations, 26 figures, 5 tables.

Figures (26)

  • Figure 1: Unified Topological Signatures (UTS) for embedding spaces. Left: We construct signature vectors for entire embedding spaces by measuring various topological descriptors. We use the vectors for downstream prediction tasks as well as for measuring representational similarity. Right: We compute local signatures based on the neighborhood of individual embeddings and use them to detect retrievability bias in document corpora.
  • Figure 2: PCA Analysis of global UTS revealing that embedding space geometry can be summarized by only a few principal components.
  • Figure 3: Model family dominates representational similarity, with lowest distances along the diagonal and clustering by architectures.
  • Figure 4: Feature importance for retrieval performance prediction model across all folds.
  • Figure 5: UMAP visualization of local space for 100 highly retrievable and 100 non-retrievable documents on the QuoraRetrieval dataset.
  • ...and 21 more figures