A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data

Efthymios Costa; Ioanna Papatsouma; Angelos Markos

A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

TL;DR

Clustering mixed-type data is challenging due to heterogeneous feature types. The paper introduces DIBmix, a deterministic information bottleneck method that uses a generalized product kernel to jointly handle continuous, nominal, and ordinal variables, optimizing $H(T)$ and $I(Y;T)$ via a KL-based update to cluster assignments. It provides a systematic bandwidth and regularisation-parameter strategy that keeps exactly $C$ clusters and balances contributions across feature types, and demonstrates superior performance relative to four established methods on 28,800 synthetic datasets and ten UCI benchmarks, particularly under imbalanced clusters and moderate overlap. The work delivers a theoretically grounded, practical clustering framework for heterogeneous data, with an R package (IBclust) and publicly available code, and discusses limitations (bandwidth sensitivity, cluster-homogeneity assumptions, and scalability) alongside avenues for future enhancements like adaptive bandwidths and knee-based cluster-number estimation.

Abstract

In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The proposed approach extends the Information Bottleneck principle to heterogeneous data through generalised product kernels, integrating continuous, nominal, and ordinal variables within a unified optimization framework. We address the following challenges: developing a systematic bandwidth selection strategy that equalises contributions across variable types, and proposing an adaptive hyperparameter updating scheme that ensures a valid solution into a predetermined number of potentially imbalanced clusters. Through simulations on 28,800 synthetic data sets and ten publicly available benchmarks, we demonstrate that the proposed method, named DIBmix, achieves superior performance compared to four established methods (KAMILA, K-Prototypes, FAMD with K-Means, and PAM with Gower's dissimilarity). Results show DIBmix particularly excels when clusters exhibit size imbalances, data contain low or moderate cluster overlap, and categorical and continuous variables are equally represented. The method presents a significant advantage over traditional centroid-based algorithms, establishing DIBmix as a competitive and theoretically grounded alternative for mixed-type data clustering.

A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data

TL;DR

and

via a KL-based update to cluster assignments. It provides a systematic bandwidth and regularisation-parameter strategy that keeps exactly

clusters and balances contributions across feature types, and demonstrates superior performance relative to four established methods on 28,800 synthetic datasets and ten UCI benchmarks, particularly under imbalanced clusters and moderate overlap. The work delivers a theoretically grounded, practical clustering framework for heterogeneous data, with an R package (IBclust) and publicly available code, and discusses limitations (bandwidth sensitivity, cluster-homogeneity assumptions, and scalability) alongside avenues for future enhancements like adaptive bandwidths and knee-based cluster-number estimation.

Abstract

Paper Structure (12 sections, 3 theorems, 20 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 3 theorems, 20 equations, 4 figures, 2 tables, 1 algorithm.

Introduction
Methodology
Hyperparameter Selection
Bandwidth Selection
Regularisation Parameter Selection
Simulations on Artificial Data
Applications to Publicly Available Data
Conclusion
The discrete uniform is a maximum entropy distribution
Mutual Information increases when a cluster is split
Perturbed similarity matrix entry properties
Results of the knee heuristic simulations

Key Result

Theorem A.1

The discrete uniform distribution with support $\mathcal{S}$ is the maximum entropy distribution among all discrete random variables with the same support.

Figures (4)

Figure 1: A Directed Acyclic Graph (DAG) representing the Markov constraint $T \leftrightarrow X \leftrightarrow Y$.
Figure 2: Violin/box plots of Adjusted Rand Index values by method.
Figure 3: Mean cluster recovery in terms of ARI of the five methods under comparison across different experimental conditions
Figure D.4: Mutual information curves against the number of clusters $C$ that DIBmix is run with for synthetic data sets with four well-separated spherical and moderately-separated non-spherical clusters, respectively. The red points correspond to the knee points of each of the curves.

Theorems & Definitions (3)

Theorem A.1
Proposition B.1
Lemma C.1

A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data

TL;DR

Abstract

A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (3)