Table of Contents
Fetching ...

Estimating class separability of text embeddings with persistent homology

Kostis Gourgoulias, Najah Ghalyan, Maxime Labonne, Yash Satsangi, Sean Moran, Joseph Sabelja

TL;DR

This work tackles unsupervised estimation of text-embedding class separability by leveraging persistent homology, focusing on the $0$th homology group $H_0$ to track the evolution of embedding manifolds during fine-tuning. By computing the persistence times $p_i$ from the Vietoris-Rips filtration and normalizing them, the authors define a persistence-score $S(oldsymbol{p})$ that captures how topologically simple the embedding space becomes as training progresses. They compare this unsupervised topological signal with standard supervised separability metrics (ROC-AUC, accuracy, Thornton index) and unsupervised baselines (Calinski-Harabasz), showing that the persistence score tracks training improvements and plateaus similarly to supervised metrics, across binary and multi-class tasks, especially when embeddings are normalized. The study demonstrates that normalization fosters well-defined connected components and that the topological perspective offers a scalable, label-free lens for monitoring and improving fine-tuning of sentence transformers in data-scarce settings, while also highlighting computational considerations and avenues for future work, including higher homology and alternative statistics.

Abstract

This paper introduces an unsupervised method to estimate the class separability of text datasets from a topological point of view. Using persistent homology, we demonstrate how tracking the evolution of embedding manifolds during training can inform about class separability. More specifically, we show how this technique can be applied to detect when the training process stops improving the separability of the embeddings. Our results, validated across binary and multi-class text classification tasks, show that the proposed method's estimates of class separability align with those obtained from supervised methods. This approach offers a novel perspective on monitoring and improving the fine-tuning of sentence transformers for classification tasks, particularly in scenarios where labeled data is scarce. We also discuss how tracking these quantities can provide additional insights into the properties of the trained classifier.

Estimating class separability of text embeddings with persistent homology

TL;DR

This work tackles unsupervised estimation of text-embedding class separability by leveraging persistent homology, focusing on the th homology group to track the evolution of embedding manifolds during fine-tuning. By computing the persistence times from the Vietoris-Rips filtration and normalizing them, the authors define a persistence-score that captures how topologically simple the embedding space becomes as training progresses. They compare this unsupervised topological signal with standard supervised separability metrics (ROC-AUC, accuracy, Thornton index) and unsupervised baselines (Calinski-Harabasz), showing that the persistence score tracks training improvements and plateaus similarly to supervised metrics, across binary and multi-class tasks, especially when embeddings are normalized. The study demonstrates that normalization fosters well-defined connected components and that the topological perspective offers a scalable, label-free lens for monitoring and improving fine-tuning of sentence transformers in data-scarce settings, while also highlighting computational considerations and avenues for future work, including higher homology and alternative statistics.

Abstract

This paper introduces an unsupervised method to estimate the class separability of text datasets from a topological point of view. Using persistent homology, we demonstrate how tracking the evolution of embedding manifolds during training can inform about class separability. More specifically, we show how this technique can be applied to detect when the training process stops improving the separability of the embeddings. Our results, validated across binary and multi-class text classification tasks, show that the proposed method's estimates of class separability align with those obtained from supervised methods. This approach offers a novel perspective on monitoring and improving the fine-tuning of sentence transformers for classification tasks, particularly in scenarios where labeled data is scarce. We also discuss how tracking these quantities can provide additional insights into the properties of the trained classifier.
Paper Structure (15 sections, 1 equation, 7 figures)

This paper contains 15 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: Evolution of the densities of persistence times for a moon dataset (sampled with "sklearn.datasets.make_moons") with different noise parameters. As noise decreases, the points in each cluster approach each other, making most (but not all!) persistence times small (see Equation \ref{['eq:vr-def']}) and increasing the confidence that a small number of connected components exists.
  • Figure 2: Separability metrics for the binary classification text example. Every line tracks the mean over all models (all-MiniLM-L6-v2, paraphrase-TinyBERT-L6-v2, and sentence-transformers/paraphrase-albert-small-v2) and datasets (the train splits of SetFit/amazon_counterfactual and SetFit/sst2 datasets from Hugging Face). We use seven splits per dataset/model combination. Accuracy, Thornton, and our persistence-score exhibit similar behavior. Additionally, the increase in the persistence score indicates that the embeddings are organized into more well-defined components. Intervals are 95% confidence bands.
  • Figure 3: A comparison of the evolution of the normalized persistence times of the tracking set of the dataset SetFit/amazon_counterfactual for the TinyBert model during fine-tuning. On the right plot, TinyBert includes an additional normalization step that ensures the embeddings have a norm equal to one. Otherwise, the setup between the two plots is the same (same splits, optimizer configuration, weight initialization for the ST). The point cloud is organized during fine-tuning so that the connected components are well-separated on the unit ball (compare with Figure \ref{['fig:moon-deformation-persistent-hom']}). On the left plot, no normalization is taking place and there are more persistence times close to one, indicating a different organization of the embedding space that nevertheless also gets an acceptable balanced accuracy ($0.79$ vs $0.88$ for the normalized case in this example). We remind here that the persistence scores are normalized so they are invariant to point-cloud scaling.
  • Figure 4: The behavior of the various separability metrics on the embeddings as epochs progress. Each line includes a 95% confidence interval and is composed of the corresponding metric over seven splits and the three models: all-MiniLM-L6-v2, paraphrase-TinyBERT-L6-v2, and sentence-transformers/paraphrase-albert-small-v2. The persistence score mimics the behavior of the supervised metrics, including the upward arc starting at epoch 4 and subsequent slowdown. The CH score continues to increase linearly throughout the epochs after epoch 4.
  • Figure 5: Another example of behavior of metrics through training; the setup is the same as for Figure \ref{['fig:multiclass-emotion-all-models-all-splits']}. In this example, we notice a saturation in terms of separability after epoch 1 for the supervised metrics and epoch 2 for the persistence score. The CH score never truly saturates indicating that the clusters continue to become more separated in the embedding space, yet any further difference does not make such a difference to the rest of the metrics.
  • ...and 2 more figures