Incremental Graph Construction Enables Robust Spectral Clustering of Texts

Marko Pranjić; Boshko Koloski; Nada Lavrač; Senja Pollak; Marko Robnik-Šikonja

Incremental Graph Construction Enables Robust Spectral Clustering of Texts

Marko Pranjić, Boshko Koloski, Nada Lavrač, Senja Pollak, Marko Robnik-Šikonja

TL;DR

This work introduces a simple incremental incremental $k$-NN graph construction that preserves connectivity by design: each new node is linked to its nearest previously inserted nodes, which guarantees a connected graph for any $k$.

Abstract

Neighborhood graphs are a critical but often fragile step in spectral clustering of text embeddings. On realistic text datasets, standard $k$-NN graphs can contain many disconnected components at practical sparsity levels (small $k$), making spectral clustering degenerate and sensitive to hyperparameters. We introduce a simple incremental $k$-NN graph construction that preserves connectivity by design: each new node is linked to its $k$ nearest previously inserted nodes, which guarantees a connected graph for any $k$. We provide an inductive proof of connectedness and discuss implications for incremental updates when new documents arrive. We validate the approach on spectral clustering of SentenceTransformer embeddings using Laplacian eigenmaps across six clustering datasets from the Massive Text Embedding Benchmark.Compared to standard $k$-NN graphs, our method outperforms in the low-$k$ regime where disconnected components are prevalent, and matches standard $k$-NN at larger $k$.

Incremental Graph Construction Enables Robust Spectral Clustering of Texts

TL;DR

This work introduces a simple incremental incremental

-NN graph construction that preserves connectivity by design: each new node is linked to its nearest previously inserted nodes, which guarantees a connected graph for any

Abstract

Neighborhood graphs are a critical but often fragile step in spectral clustering of text embeddings. On realistic text datasets, standard

-NN graphs can contain many disconnected components at practical sparsity levels (small

), making spectral clustering degenerate and sensitive to hyperparameters. We introduce a simple incremental

-NN graph construction that preserves connectivity by design: each new node is linked to its

nearest previously inserted nodes, which guarantees a connected graph for any

. We provide an inductive proof of connectedness and discuss implications for incremental updates when new documents arrive. We validate the approach on spectral clustering of SentenceTransformer embeddings using Laplacian eigenmaps across six clustering datasets from the Massive Text Embedding Benchmark.Compared to standard

-NN graphs, our method outperforms in the low-

regime where disconnected components are prevalent, and matches standard

-NN at larger

Paper Structure (17 sections, 1 equation, 6 figures, 7 tables, 1 algorithm)

This paper contains 17 sections, 1 equation, 6 figures, 7 tables, 1 algorithm.

Introduction
Background and Related work
Demonstrating disconnected components on realistic data
Cosine distance $\epsilon$-neighborhood graph
Cosine-distance based $k$-nearest neighbor graph
Incremental $k$-NN neighborhood graph
Experimental setting
Datasets
Document representation
Metrics
Results
Ablation study
The text embedding model
Adding MST to the incrementally built graph
Comparing graph properties
...and 2 more sections

Figures (6)

Figure 1: Proposed methodology. Top: General spectral clustering pipeline. After embedding the documents, a graph is constructed, projected into eigenspace, and clustered using k-means. Bottom: Comparison of a standard nearest-neighbor graph (which may be disconnected) with the proposed incremental approach that progressively links nodes to form a fully connected graph. The light blue triangle denotes the search for nearest neighbors among previously considered points, while the yellow circle represents the search for global nearest neighbors.
Figure 2: Clustering performance of embeddings induced using incremental $k$-NN (Ours) and standard $k$-NN for a range of parameter $k$. Points, where the standard $k$-NN neighborhood induces connected graphs, are marked, while in Ours, the graph is always connected. Results for sentence-to-sentence (S2S) tasks are shown on the top and for paragraph-to-paragraph (P2P) tasks at the bottom.
Figure 3: The impact of different embedding models applied to different datasets.
Figure 4: The Nemenyi-Friedman test with critical distance of multiple embedding models across all datasets and values of $k$.
Figure 5: Correlation of the graph-level and document-level properties and the V-measure.
...and 1 more figures

Theorems & Definitions (1)

proof

Incremental Graph Construction Enables Robust Spectral Clustering of Texts

TL;DR

Abstract

Incremental Graph Construction Enables Robust Spectral Clustering of Texts

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (1)