Table of Contents
Fetching ...

HBIC: A Biclustering Algorithm for Heterogeneous Datasets

Adán José-García, Julie Jacques, Clément Chauvet, Vincent Sobanski, Clarisse Dhaenens

TL;DR

HBIC addresses the challenge of biclustering heterogeneous datasets containing numeric, binary, and categorical attributes by discretizing numeric features and greedily constructing candidate biclusters, then selecting the most representative clusters via the heterogeneous intra-bicluster variance HIV. HIV combines numeric variance and discrete-value homogeneity, defined as HIV(B) = ANV(I,J_num) + ACF(I,J_cat) with $HIV(B) = ANV(I,J_{num}) + ACF(I,J_{cat})$, $ANV(I,J_{num}) = \frac{1}{|J_{num}|}\sum_{j\in J_{num}} \frac{var(b_{Ij})}{var(x_{Rj})}$ and $ACF(I,J_{cat}) = \frac{1}{|J_{cat}|} \sum_{j\in J_{cat}} \left(1 - \frac{freq(b_{Ij})}{|I|}\right)$. The method automatically determines the number of biclusters and demonstrates superior performance on 315 heterogeneous synthetic datasets and a systemic sclerosis study, outperforming standard numeric biclustering baselines CCA and LAS in recovery and relevance, and providing richer, clinically interpretable subgroups. The approach yields diverse, multi-type biclusters and is available as open-source code; it offers a practical tool for exploring complex heterogeneous data in biomedicine. Overall, HBIC advances heterogeneous biclustering by avoiding heavy feature transformations and enabling simultaneous handling of mixed data types.

Abstract

Biclustering is an unsupervised machine-learning approach aiming to cluster rows and columns simultaneously in a data matrix. Several biclustering algorithms have been proposed for handling numeric datasets. However, real-world data mining problems often involve heterogeneous datasets with mixed attributes. To address this challenge, we introduce a biclustering approach called HBIC, capable of discovering meaningful biclusters in complex heterogeneous data, including numeric, binary, and categorical data. The approach comprises two stages: bicluster generation and bicluster model selection. In the initial stage, several candidate biclusters are generated iteratively by adding and removing rows and columns based on the frequency of values in the original matrix. In the second stage, we introduce two approaches for selecting the most suitable biclusters by considering their size and homogeneity. Through a series of experiments, we investigated the suitability of our approach on a synthetic benchmark and in a biomedical application involving clinical data of systemic sclerosis patients. The evaluation comparing our method to existing approaches demonstrates its ability to discover high-quality biclusters from heterogeneous data. Our biclustering approach is a starting point for heterogeneous bicluster discovery, leading to a better understanding of complex underlying data structures.

HBIC: A Biclustering Algorithm for Heterogeneous Datasets

TL;DR

HBIC addresses the challenge of biclustering heterogeneous datasets containing numeric, binary, and categorical attributes by discretizing numeric features and greedily constructing candidate biclusters, then selecting the most representative clusters via the heterogeneous intra-bicluster variance HIV. HIV combines numeric variance and discrete-value homogeneity, defined as HIV(B) = ANV(I,J_num) + ACF(I,J_cat) with , and . The method automatically determines the number of biclusters and demonstrates superior performance on 315 heterogeneous synthetic datasets and a systemic sclerosis study, outperforming standard numeric biclustering baselines CCA and LAS in recovery and relevance, and providing richer, clinically interpretable subgroups. The approach yields diverse, multi-type biclusters and is available as open-source code; it offers a practical tool for exploring complex heterogeneous data in biomedicine. Overall, HBIC advances heterogeneous biclustering by avoiding heavy feature transformations and enabling simultaneous handling of mixed data types.

Abstract

Biclustering is an unsupervised machine-learning approach aiming to cluster rows and columns simultaneously in a data matrix. Several biclustering algorithms have been proposed for handling numeric datasets. However, real-world data mining problems often involve heterogeneous datasets with mixed attributes. To address this challenge, we introduce a biclustering approach called HBIC, capable of discovering meaningful biclusters in complex heterogeneous data, including numeric, binary, and categorical data. The approach comprises two stages: bicluster generation and bicluster model selection. In the initial stage, several candidate biclusters are generated iteratively by adding and removing rows and columns based on the frequency of values in the original matrix. In the second stage, we introduce two approaches for selecting the most suitable biclusters by considering their size and homogeneity. Through a series of experiments, we investigated the suitability of our approach on a synthetic benchmark and in a biomedical application involving clinical data of systemic sclerosis patients. The evaluation comparing our method to existing approaches demonstrates its ability to discover high-quality biclusters from heterogeneous data. Our biclustering approach is a starting point for heterogeneous bicluster discovery, leading to a better understanding of complex underlying data structures.
Paper Structure (21 sections, 10 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 21 sections, 10 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Biclustering performance scored by the reference algorithms CCA and LAS, and HBIC versions on heterogeneous synthetic datasets in terms of the metrics (a) Recovery, (b) Relevance, and (c) Biclustering Error. Filled markets $\newmoon$ at the top of the plot, indicate the versions with the highest average value, whereas the markers $+$ denote no statistically significant differences to the best-performing algorithm.
  • Figure 2: Performance of SSc metrics (mean ± std) for the 39 biclusters obtained by HBIC. Meanings appear at the bottom of Table \ref{['tab:some-bics']}.
  • Figure 3: Visualization of a numerical matrix with five biclusters and different discretization levels, $\text{nbins} = \{2,5,10,15\}$. For each data matrix, its heat map (top) and its corresponding frequency distribution (bottom) are shown. The higher the level of discretization, the closer to the original data matrix distribution.
  • Figure 4: Biclustering performance in terms of recovery (left in red) and relevance (right in blue) metrics obtained by the HBIC algorithm on the numeric datasets when varying the parameter $\text{nbins}$ in the range $\text{nbins} = \{2,3,4,5,6,7,8,9,10,15,20\}$. For both metrics, a higher value indicates better performance of the Hbic algorithm and the $\text{nbins}$ parameter.
  • Figure 5: Performance in terms of quality (left), time (center), and number of biclusters (left) obtained by the HBIC algorithm on the numerical datasets when varying the parameter $\text{nbins}$ in the range $\text{nbins} = \{2,3,4,5,6,7,8,9,10,15,20\}$. For all three metrics, a lower value indicates better performance of the HBIC algorithm and the $\text{nbins}$ parameter.