Table of Contents
Fetching ...

A Graph-Partitioning Based Continuous Optimization Approach to Semi-supervised Clustering Problems

Wei Liu, Xin Liu, Michael K. Ng, Zaikun Zhang

TL;DR

This work addresses semi-supervised clustering without requiring an exact number of clusters by recasting clustering as a graph-partitioning problem with must-link constraints. It introduces COSSC, a continuous optimization model that employs a low-rank surrogate via a matrix H and edge-removal variable Z, with must-link information encoded through a weight-augmented matrix \bar{A} and an overestimated cluster count d. A block coordinate descent algorithm alternates updating Z (via a linear program) and H (via a constrained eigensolution), with finite-time convergence guarantees and theoretical conditions (notably βp>2) ensuring must-link satisfaction. Empirical results on synthetic graphs and the TDT2 document dataset show COSSC achieves superior ACC and NMI while maintaining zero RMV and favorable CPU times, highlighting its practical robustness and efficiency for graph-based semi-supervised clustering.

Abstract

Semi-supervised clustering is a basic problem in various applications. Most existing methods require knowledge of the ideal cluster number, which is often difficult to obtain in practice. Besides, satisfying the must-link constraints is another major challenge for these methods. In this work, we view the semi-supervised clustering task as a partitioning problem on a graph associated with the given dataset, where the similarity matrix includes a scaling parameter to reflect the must-link constraints. Utilizing a relaxation technique, we formulate the graph partitioning problem into a continuous optimization model that does not require the exact cluster number, but only an overestimate of it. We then propose a block coordinate descent algorithm to efficiently solve this model, and establish its convergence result. Based on the obtained solution, we can construct the clusters that theoretically meet the must-link constraints under mild assumptions. Furthermore, we verify the effectiveness and efficiency of our proposed method through comprehensive numerical experiments.

A Graph-Partitioning Based Continuous Optimization Approach to Semi-supervised Clustering Problems

TL;DR

This work addresses semi-supervised clustering without requiring an exact number of clusters by recasting clustering as a graph-partitioning problem with must-link constraints. It introduces COSSC, a continuous optimization model that employs a low-rank surrogate via a matrix H and edge-removal variable Z, with must-link information encoded through a weight-augmented matrix \bar{A} and an overestimated cluster count d. A block coordinate descent algorithm alternates updating Z (via a linear program) and H (via a constrained eigensolution), with finite-time convergence guarantees and theoretical conditions (notably βp>2) ensuring must-link satisfaction. Empirical results on synthetic graphs and the TDT2 document dataset show COSSC achieves superior ACC and NMI while maintaining zero RMV and favorable CPU times, highlighting its practical robustness and efficiency for graph-based semi-supervised clustering.

Abstract

Semi-supervised clustering is a basic problem in various applications. Most existing methods require knowledge of the ideal cluster number, which is often difficult to obtain in practice. Besides, satisfying the must-link constraints is another major challenge for these methods. In this work, we view the semi-supervised clustering task as a partitioning problem on a graph associated with the given dataset, where the similarity matrix includes a scaling parameter to reflect the must-link constraints. Utilizing a relaxation technique, we formulate the graph partitioning problem into a continuous optimization model that does not require the exact cluster number, but only an overestimate of it. We then propose a block coordinate descent algorithm to efficiently solve this model, and establish its convergence result. Based on the obtained solution, we can construct the clusters that theoretically meet the must-link constraints under mild assumptions. Furthermore, we verify the effectiveness and efficiency of our proposed method through comprehensive numerical experiments.

Paper Structure

This paper contains 24 sections, 9 theorems, 45 equations, 8 figures, 3 tables, 1 algorithm.

Key Result

Lemma 2.1

Let $\bar{A}$ be defined as in eq:Aconstruct. We then have ${{\mathcal{S}_A^n}}=\mathcal{S}^n_{\bar{A}}$ and

Figures (8)

  • Figure 4.1: Synthetic graph datasets
  • Figure 4.2: A comparison of ACC of COSSC with different $\beta$ and $d\geq k^*$, where several lines overlap at ACC$=1$.
  • Figure 4.3: A comparison of the output clusters of COSSC with different $p$.
  • Figure 4.4: Ratio of the must-link constraints satisfied by output clusters on Figure \ref{['pic:8graph']} with different $\beta$ and $p$.
  • Figure 4.5: Comparisons between COSSC and SCA with different input numbers of clusters.
  • ...and 3 more figures

Theorems & Definitions (21)

  • Lemma 2.1
  • proof
  • Theorem 2.2
  • proof
  • Theorem 2.3
  • Remark 2.1
  • Theorem 2.4
  • Remark 2.2
  • Theorem 2.5
  • proof
  • ...and 11 more