Optimization of Inter-group Criteria for Clustering with Minimum Size Constraints

Eduardo S. Laber; Lucas Murtinho

Optimization of Inter-group Criteria for Clustering with Minimum Size Constraints

Eduardo S. Laber, Lucas Murtinho

TL;DR

This work investigates clustering with inter-group quality measures, specifically $Min-Sp$ and $MST-Sp$, and tackles the problem under a minimum cluster size constraint $L$ to avoid tiny groups. It establishes that single-linkage yields optimal $MST-Sp$ and, consequently, optimal $Min-Sp$, while highlighting practical limitations due to chaining. The authors propose size-constrained algorithms with provable guarantees: a PTAS for maximizing $Min-Sp$ under $(k,L)$-clustering and a $(\rho(1-\epsilon)/2, 1/H_{k-1})$-approximation for $MST-Sp$, with $\rho = \min\{n/(kL),2\}$, along with APX-hardness results. Empirical results on 10 real datasets show improved inter-group separability over $k$-means and single-linkage, while maintaining reasonable cluster sizes; the work also discusses practical considerations and potential extensions to other inter-group criteria.

Abstract

Internal measures that are used to assess the quality of a clustering usually take into account intra-group and/or inter-group criteria. There are many papers in the literature that propose algorithms with provable approximation guarantees for optimizing the former. However, the optimization of inter-group criteria is much less understood. Here, we contribute to the state-of-the-art of this literature by devising algorithms with provable guarantees for the maximization of two natural inter-group criteria, namely the minimum spacing and the minimum spanning tree spacing. The former is the minimum distance between points in different groups while the latter captures separability through the cost of the minimum spanning tree that connects all groups. We obtain results for both the unrestricted case, in which no constraint on the clusters is imposed, and for the constrained case where each group is required to have a minimum number of points. Our constraint is motivated by the fact that the popular Single Linkage, which optimizes both criteria in the unrestricted case, produces clusterings with many tiny groups. To complement our work, we present an empirical study with 10 real datasets, providing evidence that our methods work very well in practical settings.

Optimization of Inter-group Criteria for Clustering with Minimum Size Constraints

TL;DR

This work investigates clustering with inter-group quality measures, specifically

and

, and tackles the problem under a minimum cluster size constraint

to avoid tiny groups. It establishes that single-linkage yields optimal

and, consequently, optimal

, while highlighting practical limitations due to chaining. The authors propose size-constrained algorithms with provable guarantees: a PTAS for maximizing

under

-clustering and a

-approximation for

, with

, along with APX-hardness results. Empirical results on 10 real datasets show improved inter-group separability over

-means and single-linkage, while maintaining reasonable cluster sizes; the work also discusses practical considerations and potential extensions to other inter-group criteria.

Abstract

Paper Structure (22 sections, 15 theorems, 9 equations, 5 figures, 7 tables, 2 algorithms)

This paper contains 22 sections, 15 theorems, 9 equations, 5 figures, 7 tables, 2 algorithms.

Introduction
Preliminaries
Relating Min-Sp and MST-Sp criteria
Avoiding small groups
The Min-Sp criterion
The MST-Sp criterion
Experiments
Final Remarks
Properties of Minimum Spanning Trees
Proof of Lemma \ref{['lemm:aux-single-link']}
Proofs of Section \ref{['sec:small-groups']}
Proof of Lemma \ref{['lem:uppbound09May']}
Proof of Theorem \ref{['thm:complexity']}
Proof of Theorem \ref{['thm:complexity2']}
Experiments: Additional Information
...and 7 more sections

Key Result

Theorem 2.1

The single-linkage algorithm obtains the $k$-clustering with maximum Min-Sp for instance $(\mathcal{X},{\tt dist})$.

Figures (5)

Figure 1: Partitions with 3 groups (defined by colors) that maximize the minimum spacing. The rightmost one is built by single-linkage, but both of them maximize the minimum spacing -- showing that this condition alone is insufficient to properly characterize single-linkage's behavior.
Figure 2: Proportion of singletons for each dataset with the growth of $k$
Figure 3: Boxplots of the Min-Sp per dataset and algorithm.
Figure 4: Boxplots of the MST-Sp per dataset and algorithm.
Figure 5: Trade-off between the size of the smallest cluster and the separability criteria.

Theorems & Definitions (26)

Theorem 2.1: DBLP:books/daglib/0015106, chap 4.7
Lemma 3.1
Theorem 3.2
proof
Theorem 3.3
proof
Example 3.4
Theorem 4.1
proof
Theorem 4.2
...and 16 more

Optimization of Inter-group Criteria for Clustering with Minimum Size Constraints

TL;DR

Abstract

Optimization of Inter-group Criteria for Clustering with Minimum Size Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (26)