Table of Contents
Fetching ...

Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging

Zihan Wu, Zhaoke Huang, Hong Yan

TL;DR

This work tackles the scalability bottlenecks of co-clustering in large, high-dimensional datasets by introducing the Large-scale Adaptive Matrix Co-clustering (LAMC) framework. It partitions a large data matrix into optimally configured submatrices using a probabilistic partitioning model, co-clusters each submatrix with local methods (including a graph-based spectral approach tied to SVD), and then hierarchically merges results to form a robust global co-clustering. The framework provides theoretical bounds on partitioning reliability via $P(\omega_k)$ and $P$, and demonstrates substantial speedups (e.g., up to ~83% reduction for dense matrices and up to 30% for sparse matrices) while improving clustering quality on diverse large-scale datasets. The method’s parallelizable design and robustness against heterogeneity position it as a practical tool for large-scale bioinformatics, text, and other high-dimensional data analyses.

Abstract

Co-clustering simultaneously clusters rows and columns, revealing more fine-grained groups. However, existing co-clustering methods suffer from poor scalability and cannot handle large-scale data. This paper presents a novel and scalable co-clustering method designed to uncover intricate patterns in high-dimensional, large-scale datasets. Specifically, we first propose a large matrix partitioning algorithm that partitions a large matrix into smaller submatrices, enabling parallel co-clustering. This method employs a probabilistic model to optimize the configuration of submatrices, balancing the computational efficiency and depth of analysis. Additionally, we propose a hierarchical co-cluster merging algorithm that efficiently identifies and merges co-clusters from these submatrices, enhancing the robustness and reliability of the process. Extensive evaluations validate the effectiveness and efficiency of our method. Experimental results demonstrate a significant reduction in computation time, with an approximate 83% decrease for dense matrices and up to 30% for sparse matrices.

Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging

TL;DR

This work tackles the scalability bottlenecks of co-clustering in large, high-dimensional datasets by introducing the Large-scale Adaptive Matrix Co-clustering (LAMC) framework. It partitions a large data matrix into optimally configured submatrices using a probabilistic partitioning model, co-clusters each submatrix with local methods (including a graph-based spectral approach tied to SVD), and then hierarchically merges results to form a robust global co-clustering. The framework provides theoretical bounds on partitioning reliability via and , and demonstrates substantial speedups (e.g., up to ~83% reduction for dense matrices and up to 30% for sparse matrices) while improving clustering quality on diverse large-scale datasets. The method’s parallelizable design and robustness against heterogeneity position it as a practical tool for large-scale bioinformatics, text, and other high-dimensional data analyses.

Abstract

Co-clustering simultaneously clusters rows and columns, revealing more fine-grained groups. However, existing co-clustering methods suffer from poor scalability and cannot handle large-scale data. This paper presents a novel and scalable co-clustering method designed to uncover intricate patterns in high-dimensional, large-scale datasets. Specifically, we first propose a large matrix partitioning algorithm that partitions a large matrix into smaller submatrices, enabling parallel co-clustering. This method employs a probabilistic model to optimize the configuration of submatrices, balancing the computational efficiency and depth of analysis. Additionally, we propose a hierarchical co-cluster merging algorithm that efficiently identifies and merges co-clusters from these submatrices, enhancing the robustness and reliability of the process. Extensive evaluations validate the effectiveness and efficiency of our method. Experimental results demonstrate a significant reduction in computation time, with an approximate 83% decrease for dense matrices and up to 30% for sparse matrices.

Paper Structure

This paper contains 28 sections, 1 theorem, 18 equations, 2 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

If the matrix $\mathbf{A}$ is partitioned into $m \times n$ blocks, each with sizes $\phi_i \times \psi_j$, and the probability of failing to detect co-cluster $\mathbf{C}_k$ in any block is $P(\omega_k)$, then Given $T_p$ times of random sampling, the probability of detecting the co-cluster $C_k$ is

Figures (2)

  • Figure 1: An illustration of the differences between (a) Clustering and (b) Co-clustering yan2017CoclusteringMultidimensionalBig.
  • Figure 2: Workflow of our proposed Large-scale Adaptive Matrix Co-clustering for large matrices.

Theorems & Definitions (2)

  • Theorem 1
  • proof