RelDenClu: A Relative Density based Biclustering Method for identifying non-linear feature relations
Namita Jain, Susmita Ghosh, C. A. Murthy
TL;DR
RelDenClu introduces a non-linear, relation-based biclustering framework that leverages local density variations by comparing joint densities to marginal densities for feature pairs. It constructs biclusters as connected sets of observations and features that share dependent relationships, robust to scaling and translation and able to handle variable marginal densities. The method automatically adapts density estimates per feature pair, avoids fragmentation, and yields superior accuracy against seven state-of-the-art methods across fifteen simulated and six real datasets, including early COVID-19 feature analysis. The approach supports both unsupervised discovery and enhancement of supervised learning, offering practical utility for identifying meaningful, non-linear feature relationships in complex data.
Abstract
The existing biclustering algorithms for finding feature relation based biclusters often depend on assumptions like monotonicity or linearity. Though a few algorithms overcome this problem by using density-based methods, they tend to miss out many biclusters because they use global criteria for identifying dense regions. The proposed method, RelDenClu uses the local variations in marginal and joint densities for each pair of features to find the subset of observations, which forms the bases of the relation between them. It then finds the set of features connected by a common set of observations, resulting in a bicluster. To show the effectiveness of the proposed methodology, experimentation has been carried out on fifteen types of simulated datasets. Further, it has been applied to six real-life datasets. For three of these real-life datasets, the proposed method is used for unsupervised learning, while for other three real-life datasets it is used as an aid to supervised learning. For all the datasets the performance of the proposed method is compared with that of seven different state-of-the-art algorithms and the proposed algorithm is seen to produce better results. The efficacy of proposed algorithm is also seen by its use on COVID-19 dataset for identifying some features (genetic, demographics and others) that are likely to affect the spread of COVID-19.
