Table of Contents
Fetching ...

CDC: A Simple Framework for Complex Data Clustering

Zhao Kang, Xuanting Xie, Bingheng Li, Erlin Pan

TL;DR

CDC tackles the challenge of clustering complex data by unifying graph filtering with adaptive anchor learning under a similarity-preserving regularizer, enabling linear-time clustering across single-view, multi-view, graph, and non-graph data. The framework learns a small set of high-quality anchors and a consensus anchor graph to enable efficient spectral clustering via SVD on Z followed by K-means, with learnable view weights improving integration. Theoretical guarantees show that filtering preserves both topology and attribute similarity and that the learned anchor graph is clustering-friendly. Empirically, CDC achieves strong results across 14 datasets, scales to 111M nodes, and often outperforms state-of-the-art GNN-based methods while offering favorable runtime trade-offs, highlighting its practical potential for large-scale, heterogeneous clustering tasks.

Abstract

In today's data-driven digital era, the amount as well as complexity, such as multi-view, non-Euclidean, and multi-relational, of the collected data are growing exponentially or even faster. Clustering, which unsupervisely extracts valid knowledge from data, is extremely useful in practice. However, existing methods are independently developed to handle one particular challenge at the expense of the others. In this work, we propose a simple but effective framework for complex data clustering (CDC) that can efficiently process different types of data with linear complexity. We first utilize graph filtering to fuse geometry structure and attribute information. We then reduce the complexity with high-quality anchors that are adaptively learned via a novel similarity-preserving regularizer. We illustrate the cluster-ability of our proposed method theoretically and experimentally. In particular, we deploy CDC to graph data of size 111M.

CDC: A Simple Framework for Complex Data Clustering

TL;DR

CDC tackles the challenge of clustering complex data by unifying graph filtering with adaptive anchor learning under a similarity-preserving regularizer, enabling linear-time clustering across single-view, multi-view, graph, and non-graph data. The framework learns a small set of high-quality anchors and a consensus anchor graph to enable efficient spectral clustering via SVD on Z followed by K-means, with learnable view weights improving integration. Theoretical guarantees show that filtering preserves both topology and attribute similarity and that the learned anchor graph is clustering-friendly. Empirically, CDC achieves strong results across 14 datasets, scales to 111M nodes, and often outperforms state-of-the-art GNN-based methods while offering favorable runtime trade-offs, highlighting its practical potential for large-scale, heterogeneous clustering tasks.

Abstract

In today's data-driven digital era, the amount as well as complexity, such as multi-view, non-Euclidean, and multi-relational, of the collected data are growing exponentially or even faster. Clustering, which unsupervisely extracts valid knowledge from data, is extremely useful in practice. However, existing methods are independently developed to handle one particular challenge at the expense of the others. In this work, we propose a simple but effective framework for complex data clustering (CDC) that can efficiently process different types of data with linear complexity. We first utilize graph filtering to fuse geometry structure and attribute information. We then reduce the complexity with high-quality anchors that are adaptively learned via a novel similarity-preserving regularizer. We illustrate the cluster-ability of our proposed method theoretically and experimentally. In particular, we deploy CDC to graph data of size 111M.
Paper Structure (29 sections, 2 theorems, 11 equations, 5 figures, 11 tables)

This paper contains 29 sections, 2 theorems, 11 equations, 5 figures, 11 tables.

Key Result

Theorem 4.2

Define the distance between filtered node $i$ and $j$ is $\|h_i-h_j\|^2$, we have $\|h_i-h_j\|^2\leq \frac{1}{2^{2k}} [\|(A_i-A_j)\sum_{i=0}^{k-1}{i \choose N}A^iX\|^2+\|x_i-x_j\|^2]$, i.e., the filtered features $H$ preserve both topology and attribute similarity.

Figures (5)

  • Figure 1: Visualization of learned graph $Z$'s grouping effect.
  • Figure 2: Run time of existing SOTA methods on various datasets
  • Figure 3: Results on ACM and Pubmed with different anchor number $m$.
  • Figure 4: Accuracy on ACM and Pubmed with different $\alpha$ and $\beta$.
  • Figure 5: The objective function value of CDC.

Theorems & Definitions (5)

  • Definition 4.1: Grouping effect GEKDD
  • Theorem 4.2
  • proof
  • Theorem 4.3
  • proof