Table of Contents
Fetching ...

AdaBox: Adaptive Density-Based Box Clustering with Parameter Generalization

Ahmed Elmahdi

Abstract

Density-based clustering algorithms like DBSCAN and HDBSCAN are foundational tools for discovering arbitrarily shaped clusters, yet their practical utility is undermined by acute hyperparameter sensitivity -- parameters tuned on one dataset frequently fail to transfer to others, requiring expensive re-optimization for each deployment. We introduce AdaBox (Adaptive Density-Based Box Clustering), a grid-based density clustering algorithm designed for robustness across diverse data geometries. AdaBox features a six-parameter design where parameters capture cluster structure rather than pairwise point relationships. Four parameters are inherently scale-invariant, one self-corrects for sampling bias, and one is adjusted via a density scaling stage, enabling reliable parameter transfer across 30-200x scale factors. AdaBox processes data through five stages: adaptive grid construction, liberal seed initialization, iterative growth with graduation, statistical cluster merging, and Gaussian boundary refinement. Comprehensive evaluation across 111 datasets demonstrates three key findings: (1) AdaBox significantly outperforms DBSCAN and HDBSCAN across five evaluation metrics, achieving the best score on 78\% of datasets with p < 0.05; (2) AdaBox uniquely exhibits parameter generalization. Protocol A (direct transfer to 30-100x larger datasets) shows AdaBox maintains performance while baselines collapse. (3) Ablation studies confirm the necessity of all five architectural stages for maintaining robustness.

AdaBox: Adaptive Density-Based Box Clustering with Parameter Generalization

Abstract

Density-based clustering algorithms like DBSCAN and HDBSCAN are foundational tools for discovering arbitrarily shaped clusters, yet their practical utility is undermined by acute hyperparameter sensitivity -- parameters tuned on one dataset frequently fail to transfer to others, requiring expensive re-optimization for each deployment. We introduce AdaBox (Adaptive Density-Based Box Clustering), a grid-based density clustering algorithm designed for robustness across diverse data geometries. AdaBox features a six-parameter design where parameters capture cluster structure rather than pairwise point relationships. Four parameters are inherently scale-invariant, one self-corrects for sampling bias, and one is adjusted via a density scaling stage, enabling reliable parameter transfer across 30-200x scale factors. AdaBox processes data through five stages: adaptive grid construction, liberal seed initialization, iterative growth with graduation, statistical cluster merging, and Gaussian boundary refinement. Comprehensive evaluation across 111 datasets demonstrates three key findings: (1) AdaBox significantly outperforms DBSCAN and HDBSCAN across five evaluation metrics, achieving the best score on 78\% of datasets with p < 0.05; (2) AdaBox uniquely exhibits parameter generalization. Protocol A (direct transfer to 30-100x larger datasets) shows AdaBox maintains performance while baselines collapse. (3) Ablation studies confirm the necessity of all five architectural stages for maintaining robustness.
Paper Structure (23 sections, 7 figures, 6 tables)

This paper contains 23 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: AdaBox Algorithm Pipeline. Five-stage processing pipeline showing: Adaptive Grid Construction $\rightarrow$ Liberal Seed Initialization $\rightarrow$ Iterative Growth with Graduation $\rightarrow$ Statistical Cluster Merging $\rightarrow$ Gaussian Boundary Refinement.
  • Figure 2: AdaBox achieves the highest average score on all five metrics.
  • Figure 3: Visual clustering comparison on the TrafficFlow_Urban dataset showing AdaBox (ARI: 0.70) vs. DBSCAN (0.15) vs. HDBSCAN (0.53).
  • Figure 4: Clustering comparison on the Pendigits benchmark. The Pendigits dataset (16D original, reduced to 2D via PCA) presents a challenging 10-class clustering problem with overlapping cluster boundaries. (a) Ground truth labels. (b) AdaBox achieves ARI = 0.431. (c) DBSCAN with optimized ARI. (d) HDBSCAN with ARI = 0.246.
  • Figure 5: AdaBox's advantage grows on challenging datasets. Win rates across Groups 7--9 (15 large-scale real-world datasets) show AdaBox achieving 80--100% on both ARI and SCOPE, while baselines cluster near 10--40%---a wider margin than observed on synthetic benchmarks.
  • ...and 2 more figures