A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data
Efthymios Costa, Ioanna Papatsouma, Angelos Markos
TL;DR
Clustering mixed-type data is challenging due to heterogeneous feature types. The paper introduces DIBmix, a deterministic information bottleneck method that uses a generalized product kernel to jointly handle continuous, nominal, and ordinal variables, optimizing $H(T)$ and $I(Y;T)$ via a KL-based update to cluster assignments. It provides a systematic bandwidth and regularisation-parameter strategy that keeps exactly $C$ clusters and balances contributions across feature types, and demonstrates superior performance relative to four established methods on 28,800 synthetic datasets and ten UCI benchmarks, particularly under imbalanced clusters and moderate overlap. The work delivers a theoretically grounded, practical clustering framework for heterogeneous data, with an R package (IBclust) and publicly available code, and discusses limitations (bandwidth sensitivity, cluster-homogeneity assumptions, and scalability) alongside avenues for future enhancements like adaptive bandwidths and knee-based cluster-number estimation.
Abstract
In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The proposed approach extends the Information Bottleneck principle to heterogeneous data through generalised product kernels, integrating continuous, nominal, and ordinal variables within a unified optimization framework. We address the following challenges: developing a systematic bandwidth selection strategy that equalises contributions across variable types, and proposing an adaptive hyperparameter updating scheme that ensures a valid solution into a predetermined number of potentially imbalanced clusters. Through simulations on 28,800 synthetic data sets and ten publicly available benchmarks, we demonstrate that the proposed method, named DIBmix, achieves superior performance compared to four established methods (KAMILA, K-Prototypes, FAMD with K-Means, and PAM with Gower's dissimilarity). Results show DIBmix particularly excels when clusters exhibit size imbalances, data contain low or moderate cluster overlap, and categorical and continuous variables are equally represented. The method presents a significant advantage over traditional centroid-based algorithms, establishing DIBmix as a competitive and theoretically grounded alternative for mixed-type data clustering.
