Table of Contents
Fetching ...

A Unifying Family of Data-Adaptive Partitioning Algorithms

Guy B. Oldaker, Maria Emelianenko

TL;DR

This work introduces a unifying, data-adaptive family of partitioning algorithms parameterized by $\alpha \in [0,1]$ that encompasses and extends classic clustering methods such as $k$-means and $k$-subspaces. Through a single objective $\mathcal{G}_{\alpha}$ and alternating minimization, the approach jointly optimizes Voronoi sets, orthogonal projectors, and centroids, with an adaptive mechanism that can adjust the number of clusters $k$ and total dimension $r$ based on data structure. The authors demonstrate versatility across subspace clustering, model order reduction, and matrix approximation, achieving automatic structure discovery and competitive or improved performance relative to established methods. The work suggests broad potential for cross-domain integration and motivates further exploration of parameter tuning, ensemble strategies, and connections to existing convergence theories. Overall, the framework offers a scalable, interpretable toolkit for high-dimensional data analysis with automatic adaptation capabilities.

Abstract

Clustering algorithms remain valuable tools for grouping and summarizing the most important aspects of data. Example areas where this is the case include image segmentation, dimension reduction, signals analysis, model order reduction, numerical analysis, and others. As a consequence, many clustering approaches have been developed to satisfy the unique needs of each particular field. In this article, we present a family of data-adaptive partitioning algorithms that unifies several well-known methods (e.g., k-means and k-subspaces). Indexed by a single parameter and employing a common minimization strategy, the algorithms are easy to use and interpret, and scale well to large, high-dimensional problems. In addition, we develop an adaptive mechanism that (a) exhibits skill at automatically uncovering data structures and problem parameters without any expert knowledge and, (b) can be used to augment other existing methods. By demonstrating the performance of our methods on examples from disparate fields including subspace clustering, model order reduction, and matrix approximation, we hope to highlight their versatility and potential for extending the boundaries of existing scientific domains. We believe our family's parametrized structure represents a synergism of algorithms that will foster new developments and directions, not least within the data science community.

A Unifying Family of Data-Adaptive Partitioning Algorithms

TL;DR

This work introduces a unifying, data-adaptive family of partitioning algorithms parameterized by that encompasses and extends classic clustering methods such as -means and -subspaces. Through a single objective and alternating minimization, the approach jointly optimizes Voronoi sets, orthogonal projectors, and centroids, with an adaptive mechanism that can adjust the number of clusters and total dimension based on data structure. The authors demonstrate versatility across subspace clustering, model order reduction, and matrix approximation, achieving automatic structure discovery and competitive or improved performance relative to established methods. The work suggests broad potential for cross-domain integration and motivates further exploration of parameter tuning, ensemble strategies, and connections to existing convergence theories. Overall, the framework offers a scalable, interpretable toolkit for high-dimensional data analysis with automatic adaptation capabilities.

Abstract

Clustering algorithms remain valuable tools for grouping and summarizing the most important aspects of data. Example areas where this is the case include image segmentation, dimension reduction, signals analysis, model order reduction, numerical analysis, and others. As a consequence, many clustering approaches have been developed to satisfy the unique needs of each particular field. In this article, we present a family of data-adaptive partitioning algorithms that unifies several well-known methods (e.g., k-means and k-subspaces). Indexed by a single parameter and employing a common minimization strategy, the algorithms are easy to use and interpret, and scale well to large, high-dimensional problems. In addition, we develop an adaptive mechanism that (a) exhibits skill at automatically uncovering data structures and problem parameters without any expert knowledge and, (b) can be used to augment other existing methods. By demonstrating the performance of our methods on examples from disparate fields including subspace clustering, model order reduction, and matrix approximation, we hope to highlight their versatility and potential for extending the boundaries of existing scientific domains. We believe our family's parametrized structure represents a synergism of algorithms that will foster new developments and directions, not least within the data science community.

Paper Structure

This paper contains 12 sections, 1 theorem, 45 equations, 4 figures.

Key Result

Theorem 1

Let $A \in \mathbb{R}^{m \times n}$, rank($A$) = $\rho$, and $0<r \le \rho$, $0<k< n$ be integers. If $C \in \mathbb{R}^{m \times r}$ is the output from a CSSP routine paired with an adaptive or non-adaptive variant from the family defined by $\mathcal{G}_\alpha$ with $\alpha = 0$ and $m_i = 0$, the where $\mathcal{G}^*$ is the energy value of either of these variants at completion.

Figures (4)

  • Figure 1: Example performance of our algorithm on an idealized clustering task with (top) and without (bottom) adaptation as the indexing parameter, $\alpha$, is varied. The data consist of five multi-variate Gaussian point clouds in $\mathbb{R}^{6000}$ (the data were projected onto $\mathbb{R}^{250}$ via a Gaussian random embedding prior to clustering dong2021simpler). Both algorithms are initialized for $k = 10$ clusters and total dimension 250 (See \ref{['section1']} for algorithm input details). Note that for $\alpha \in \{0.25,0.5,0.75\}$ the adaptive variant uncovers the correct number of clusters and labeling, while the non-adaptive version struggles. The tSNE algorithm hinton2002stochasticvan2008visualizing is used for visualization.
  • Figure 2: Subspace clustering result (colors) using the adaptive family with $\alpha = 0$, $k = 4$, and total dimension, $d_{total} = 7$. Note that the algorithm is able to uncover the correct clusters and dimensions despite being initialized differently.
  • Figure 3: Error comparison for POD and our adaptive family (labeled $G_{adapt}$) in time (Figure \ref{['fig:morfigure1']}) and at time $t = 2$ (Figure \ref{['fig:morfigure2']}). See text for algorithm settings.
  • Figure 4: Matrix approximation error results for various algorithms from our family. The data consists of the matrix $A \in \mathbb{R}^{60000 \times 785}$ containing MNIST training images. Images taken from emelianenko2024optimality.

Theorems & Definitions (1)

  • Theorem 1