Estimating class separability of text embeddings with persistent homology
Kostis Gourgoulias, Najah Ghalyan, Maxime Labonne, Yash Satsangi, Sean Moran, Joseph Sabelja
TL;DR
This work tackles unsupervised estimation of text-embedding class separability by leveraging persistent homology, focusing on the $0$th homology group $H_0$ to track the evolution of embedding manifolds during fine-tuning. By computing the persistence times $p_i$ from the Vietoris-Rips filtration and normalizing them, the authors define a persistence-score $S(oldsymbol{p})$ that captures how topologically simple the embedding space becomes as training progresses. They compare this unsupervised topological signal with standard supervised separability metrics (ROC-AUC, accuracy, Thornton index) and unsupervised baselines (Calinski-Harabasz), showing that the persistence score tracks training improvements and plateaus similarly to supervised metrics, across binary and multi-class tasks, especially when embeddings are normalized. The study demonstrates that normalization fosters well-defined connected components and that the topological perspective offers a scalable, label-free lens for monitoring and improving fine-tuning of sentence transformers in data-scarce settings, while also highlighting computational considerations and avenues for future work, including higher homology and alternative statistics.
Abstract
This paper introduces an unsupervised method to estimate the class separability of text datasets from a topological point of view. Using persistent homology, we demonstrate how tracking the evolution of embedding manifolds during training can inform about class separability. More specifically, we show how this technique can be applied to detect when the training process stops improving the separability of the embeddings. Our results, validated across binary and multi-class text classification tasks, show that the proposed method's estimates of class separability align with those obtained from supervised methods. This approach offers a novel perspective on monitoring and improving the fine-tuning of sentence transformers for classification tasks, particularly in scenarios where labeled data is scarce. We also discuss how tracking these quantities can provide additional insights into the properties of the trained classifier.
