Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging
Zihan Wu, Zhaoke Huang, Hong Yan
TL;DR
This work tackles the scalability bottlenecks of co-clustering in large, high-dimensional datasets by introducing the Large-scale Adaptive Matrix Co-clustering (LAMC) framework. It partitions a large data matrix into optimally configured submatrices using a probabilistic partitioning model, co-clusters each submatrix with local methods (including a graph-based spectral approach tied to SVD), and then hierarchically merges results to form a robust global co-clustering. The framework provides theoretical bounds on partitioning reliability via $P(\omega_k)$ and $P$, and demonstrates substantial speedups (e.g., up to ~83% reduction for dense matrices and up to 30% for sparse matrices) while improving clustering quality on diverse large-scale datasets. The method’s parallelizable design and robustness against heterogeneity position it as a practical tool for large-scale bioinformatics, text, and other high-dimensional data analyses.
Abstract
Co-clustering simultaneously clusters rows and columns, revealing more fine-grained groups. However, existing co-clustering methods suffer from poor scalability and cannot handle large-scale data. This paper presents a novel and scalable co-clustering method designed to uncover intricate patterns in high-dimensional, large-scale datasets. Specifically, we first propose a large matrix partitioning algorithm that partitions a large matrix into smaller submatrices, enabling parallel co-clustering. This method employs a probabilistic model to optimize the configuration of submatrices, balancing the computational efficiency and depth of analysis. Additionally, we propose a hierarchical co-cluster merging algorithm that efficiently identifies and merges co-clusters from these submatrices, enhancing the robustness and reliability of the process. Extensive evaluations validate the effectiveness and efficiency of our method. Experimental results demonstrate a significant reduction in computation time, with an approximate 83% decrease for dense matrices and up to 30% for sparse matrices.
