Table of Contents
Fetching ...

Similarity and Dissimilarity Guided Co-association Matrix Construction for Ensemble Clustering

Xu Zhang, Yuheng Jia, Mofei Song, Ran Wang

TL;DR

The Similarity and Dissimilarity Guided Co-association matrix (SDGCA) is proposed, which introduces normalized ensemble entropy to estimate the quality of each cluster, and construct a similarity matrix based on this estimation and the adversarial relationship between the similarity matrix and the dissimilarity matrix is utilized to construct a promoted CA matrix for ensemble clustering.

Abstract

Ensemble clustering aggregates multiple weak clusterings to achieve a more accurate and robust consensus result. The Co-Association matrix (CA matrix) based method is the mainstream ensemble clustering approach that constructs the similarity relationships between sample pairs according the weak clustering partitions to generate the final clustering result. However, the existing methods neglect that the quality of cluster is related to its size, i.e., a cluster with smaller size tends to higher accuracy. Moreover, they also do not consider the valuable dissimilarity information in the base clusterings which can reflect the varying importance of sample pairs that are completely disconnected. To this end, we propose the Similarity and Dissimilarity Guided Co-association matrix (SDGCA) to achieve ensemble clustering. First, we introduce normalized ensemble entropy to estimate the quality of each cluster, and construct a similarity matrix based on this estimation. Then, we employ the random walk to explore high-order proximity of base clusterings to construct a dissimilarity matrix. Finally, the adversarial relationship between the similarity matrix and the dissimilarity matrix is utilized to construct a promoted CA matrix for ensemble clustering. We compared our method with 13 state-of-the-art methods across 12 datasets, and the results demonstrated the superiority clustering ability and robustness of the proposed approach. The code is available at https://github.com/xuz2019/SDGCA.

Similarity and Dissimilarity Guided Co-association Matrix Construction for Ensemble Clustering

TL;DR

The Similarity and Dissimilarity Guided Co-association matrix (SDGCA) is proposed, which introduces normalized ensemble entropy to estimate the quality of each cluster, and construct a similarity matrix based on this estimation and the adversarial relationship between the similarity matrix and the dissimilarity matrix is utilized to construct a promoted CA matrix for ensemble clustering.

Abstract

Ensemble clustering aggregates multiple weak clusterings to achieve a more accurate and robust consensus result. The Co-Association matrix (CA matrix) based method is the mainstream ensemble clustering approach that constructs the similarity relationships between sample pairs according the weak clustering partitions to generate the final clustering result. However, the existing methods neglect that the quality of cluster is related to its size, i.e., a cluster with smaller size tends to higher accuracy. Moreover, they also do not consider the valuable dissimilarity information in the base clusterings which can reflect the varying importance of sample pairs that are completely disconnected. To this end, we propose the Similarity and Dissimilarity Guided Co-association matrix (SDGCA) to achieve ensemble clustering. First, we introduce normalized ensemble entropy to estimate the quality of each cluster, and construct a similarity matrix based on this estimation. Then, we employ the random walk to explore high-order proximity of base clusterings to construct a dissimilarity matrix. Finally, the adversarial relationship between the similarity matrix and the dissimilarity matrix is utilized to construct a promoted CA matrix for ensemble clustering. We compared our method with 13 state-of-the-art methods across 12 datasets, and the results demonstrated the superiority clustering ability and robustness of the proposed approach. The code is available at https://github.com/xuz2019/SDGCA.

Paper Structure

This paper contains 25 sections, 46 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: The relationship between the size of clusters and their precision in SPF dataset. The vertical axis represents their mean and median precision. For example, the first violin plot illustrates that clusters with sizes ranging from 0 to 100 have a mean precision and a median precision of 0.65 and 0.64, respectively. It can be observed smaller cluster sizes generally imply higher precision in both mean and median values.
  • Figure 2: $\pi^1$ and $\pi^2$ represent two base clusterings, dividing six samples, with the corresponding adjacency matrices on the right. $\pi^*$ is the ensemble of $\pi^1$ and $\pi^2$, with the CA, LWCA and NWCA matrices on the right. The far-right side represents the dissimilarity between samples. It is evident in the ensemble that $\{x_1, x_4\}$ are not connected in the same cluster, but intuitively, they should belong to one cluster.
  • Figure 3: Clustering performances with respect to NMI with varying $\eta$ and $\theta$.
  • Figure 4: Comparison of NWCA and LWCA on NMI Index and the horizontal axis represents the hyper-parameter $\lambda$.
  • Figure 5: Clustering performances of different algorithms by varying ensemble size w.r.t. NMI.
  • ...and 4 more figures