A Method for Handling Negative Similarities in Explainable Graph Spectral Clustering of Text Documents -- Extended Version
Mieczysław A. Kłopotek, Sławomir T. Wierzchoń, Bartłomiej Starosta, Dariusz Czerski, Piotr Borkowski
TL;DR
The paper addresses the challenge of negative cosine similarities in graph spectral clustering of text documents, caused by embeddings like GloVe. It analyzes and proposes similarity-transformations (including s_{ik}^{(pN)}, s_{ik}^{(pD)}, and related schemes) to preserve clustering objectives while rendering the Laplacians computable. Through experiments on tweet data with GloVe and TVS embeddings, it shows that zeroing negatives often harms normalized L-GSC, whereas carefully chosen transforms improve performance for both combinatorial and normalized Laplacians and maintain explainability. The findings extend GSC applicability to modern text embeddings and motivate future work with additional embeddings and longer texts.
Abstract
This paper investigates the problem of Graph Spectral Clustering with negative similarities, resulting from document embeddings different from the traditional Term Vector Space (like doc2vec, GloVe, etc.). Solutions for combinatorial Laplacians and normalized Laplacians are discussed. An experimental investigation shows the advantages and disadvantages of 6 different solutions proposed in the literature and in this research. The research demonstrates that GloVe embeddings frequently cause failures of normalized Laplacian based GSC due to negative similarities. Furthermore, application of methods curing similarity negativity leads to accuracy improvement for both combinatorial and normalized Laplacian based GSC. It also leads to applicability for GloVe embeddings of explanation methods developed originally bythe authors for Term Vector Space embeddings.
