Table of Contents
Fetching ...

LSEC: Large-scale spectral ensemble clustering

Hongmin Li, Xiucai Ye, Akira Imakura, Tetsuya Sakurai

TL;DR

This paper tackles the efficiency challenges of large-scale ensemble clustering by introducing LSEC, which combines a divide-and-conquer large-scale spectral clustering approach for generating diverse base clusterings with a bipartite-graph based consensus function. It introduces two acceleration tricks—reusing $K$-nearest neighbors and light-$k$-means—to drastically reduce computation without sacrificing accuracy. The method yields a lower overall complexity than many existing ensemble approaches and demonstrates strong performance on ten large-scale datasets in terms of ACC and NMI, while also achieving faster runtimes. The work offers a practical, scalable framework for ensemble clustering capable of handling datasets with millions of points, with broad implications for applications requiring robust consensus clustering at scale.

Abstract

Ensemble clustering is a fundamental problem in the machine learning field, combining multiple base clusterings into a better clustering result. However, most of the existing methods are unsuitable for large-scale ensemble clustering tasks due to the efficiency bottleneck. In this paper, we propose a large-scale spectral ensemble clustering (LSEC) method to strike a good balance between efficiency and effectiveness. In LSEC, a large-scale spectral clustering based efficient ensemble generation framework is designed to generate various base clusterings within a low computational complexity. Then all based clustering are combined through a bipartite graph partition based consensus function into a better consensus clustering result. The LSEC method achieves a lower computational complexity than most existing ensemble clustering methods. Experiments conducted on ten large-scale datasets show the efficiency and effectiveness of the LSEC method. The MATLAB code of the proposed method and experimental datasets are available at https://github.com/Li- Hongmin/MyPaperWithCode.

LSEC: Large-scale spectral ensemble clustering

TL;DR

This paper tackles the efficiency challenges of large-scale ensemble clustering by introducing LSEC, which combines a divide-and-conquer large-scale spectral clustering approach for generating diverse base clusterings with a bipartite-graph based consensus function. It introduces two acceleration tricks—reusing -nearest neighbors and light--means—to drastically reduce computation without sacrificing accuracy. The method yields a lower overall complexity than many existing ensemble approaches and demonstrates strong performance on ten large-scale datasets in terms of ACC and NMI, while also achieving faster runtimes. The work offers a practical, scalable framework for ensemble clustering capable of handling datasets with millions of points, with broad implications for applications requiring robust consensus clustering at scale.

Abstract

Ensemble clustering is a fundamental problem in the machine learning field, combining multiple base clusterings into a better clustering result. However, most of the existing methods are unsuitable for large-scale ensemble clustering tasks due to the efficiency bottleneck. In this paper, we propose a large-scale spectral ensemble clustering (LSEC) method to strike a good balance between efficiency and effectiveness. In LSEC, a large-scale spectral clustering based efficient ensemble generation framework is designed to generate various base clusterings within a low computational complexity. Then all based clustering are combined through a bipartite graph partition based consensus function into a better consensus clustering result. The LSEC method achieves a lower computational complexity than most existing ensemble clustering methods. Experiments conducted on ten large-scale datasets show the efficiency and effectiveness of the LSEC method. The MATLAB code of the proposed method and experimental datasets are available at https://github.com/Li- Hongmin/MyPaperWithCode.

Paper Structure

This paper contains 22 sections, 21 equations, 2 figures, 9 tables, 2 algorithms.

Figures (2)

  • Figure 1: An overview of proposed method. Given a dataset, $\frac{m}{q}$ sets of landmarks are first generated, then a set of $K$-nearest neighbors are found for each $R^{(i)}$ and $m$ sparse similarity matrices are constructed, finally the base clusterings are obtained through a bipartite graph partitioning process. The proposed method accelerates the similarity matrix construction by recycling $K$-nearest neighbors and bipartite graph partitioning by applying light-$k$-means.
  • Figure 2: Illustration of the five synthetic datasets. Note that only $0.1\%$ samples of each dataset are plotted.