Table of Contents
Fetching ...

Flexible Bivariate Beta Mixture Model: A Probabilistic Approach for Clustering Complex Data Structures

Yung-Peng Hsu, Hung-Hsuan Chen

TL;DR

The paper addresses clustering of data with nonconvex and irregular structures by proposing the Flexible Bivariate Beta Mixture Model (FBBMM), a probabilistic mixture where each component uses a four-parameter flexible bivariate beta distribution. Parameter learning is performed via the EM algorithm, with an SLSQP optimizer for the cluster-specific shape parameters, enabling soft clustering and the modeling of both positive and negative correlations. Empirical results on synthetic nonconvex shapes and open datasets (wine, MNIST-derived features) show FBBMM outperforms traditional methods such as k-means, DBSCAN, GMM, and MBMM, highlighting its capacity to capture complex data geometries. The approach provides a practical, generative framework for clustering complex data and offers avenues for extension to higher dimensions and more robust noise handling, with an open-source implementation available.

Abstract

Clustering is essential in data analysis and machine learning, but traditional algorithms like $k$-means and Gaussian Mixture Models (GMM) often fail with nonconvex clusters. To address the challenge, we introduce the Flexible Bivariate Beta Mixture Model (FBBMM), which utilizes the flexibility of the bivariate beta distribution to handle diverse and irregular cluster shapes. Using the Expectation Maximization (EM) algorithm and Sequential Least Squares Programming (SLSQP) optimizer for parameter estimation, we validate FBBMM on synthetic and real-world datasets, demonstrating its superior performance in clustering complex data structures, offering a robust solution for big data analytics across various domains. We release the experimental code at https://github.com/yung-peng/MBMM-and-FBBMM.

Flexible Bivariate Beta Mixture Model: A Probabilistic Approach for Clustering Complex Data Structures

TL;DR

The paper addresses clustering of data with nonconvex and irregular structures by proposing the Flexible Bivariate Beta Mixture Model (FBBMM), a probabilistic mixture where each component uses a four-parameter flexible bivariate beta distribution. Parameter learning is performed via the EM algorithm, with an SLSQP optimizer for the cluster-specific shape parameters, enabling soft clustering and the modeling of both positive and negative correlations. Empirical results on synthetic nonconvex shapes and open datasets (wine, MNIST-derived features) show FBBMM outperforms traditional methods such as k-means, DBSCAN, GMM, and MBMM, highlighting its capacity to capture complex data geometries. The approach provides a practical, generative framework for clustering complex data and offers avenues for extension to higher dimensions and more robust noise handling, with an open-source implementation available.

Abstract

Clustering is essential in data analysis and machine learning, but traditional algorithms like -means and Gaussian Mixture Models (GMM) often fail with nonconvex clusters. To address the challenge, we introduce the Flexible Bivariate Beta Mixture Model (FBBMM), which utilizes the flexibility of the bivariate beta distribution to handle diverse and irregular cluster shapes. Using the Expectation Maximization (EM) algorithm and Sequential Least Squares Programming (SLSQP) optimizer for parameter estimation, we validate FBBMM on synthetic and real-world datasets, demonstrating its superior performance in clustering complex data structures, offering a robust solution for big data analytics across various domains. We release the experimental code at https://github.com/yung-peng/MBMM-and-FBBMM.

Paper Structure

This paper contains 13 sections, 14 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: The PDF plots of the bivariate beta distribution with different parameters. The top row: $\bm{\alpha}=(3,3,3,3)$; $\bm{\alpha}=(1,1,1,1)$; $\bm{\alpha}=(0.8,0.8,0.8,0.8)$. The middle row: $\bm{\alpha}=(2,4,2,2)$; $\bm{\alpha}=(4,2,2,2)$; $\bm{\alpha}=(4,2,4,0.5)$. The bottom row: $\bm{\alpha}=(2,2,2,0)$; $\bm{\alpha}=(1,1,1,0.5)$; $\bm{\alpha}=(0.5,1,1,1)$. The shapes could be nonconvex (e.g., upper right subfigure). The covariates could be positively correlated (e.g., the middle center subfigure) or negatively correlated (e.g., the lower left subfigure), or non-correlated (e.g., the upper middle subfigure).
  • Figure 2: The plate notation of the flexible bivariate beta mixture model
  • Figure 3: Clustering results on synthetic datasets