Table of Contents
Fetching ...

Angular Constraint Embedding via SpherePair Loss for Constrained Clustering

Shaojie Zhang, Ke Chen

TL;DR

SpherePair introduces an anchor-free, angular constraint embedding for constrained clustering that learns angular representations balancing pairwise constraints in a bounded space. The approach decouples representation learning from clustering by optimizing an angular loss with a reconstruction term, yielding spherical embeddings where positive pairs cluster together and negative pairs occupy a defined negative zone. Theoretical results establish the conditions for a conflict-free embedding, the required embedding dimension $D$ relative to the true cluster count $K$, and a PCA-based method to infer $K$ without retraining. Empirically, SpherePair outperforms state-of-the-art DCC baselines across diverse datasets, handles unknown cluster numbers, and exhibits robustness to constraint imbalance, with practical guidance for hyperparameters. These properties make SpherePair particularly suitable for scalable, real-world constrained clustering tasks where the true number of clusters is not known a priori.

Abstract

Constrained clustering integrates domain knowledge through pairwise constraints. However, existing deep constrained clustering (DCC) methods are either limited by anchors inherent in end-to-end modeling or struggle with learning discriminative Euclidean embedding, restricting their scalability and real-world applicability. To avoid their respective pitfalls, we propose a novel angular constraint embedding approach for DCC, termed SpherePair. Using the SpherePair loss with a geometric formulation, our method faithfully encodes pairwise constraints and leads to embeddings that are clustering-friendly in angular space, effectively separating representation learning from clustering. SpherePair preserves pairwise relations without conflict, removes the need to specify the exact number of clusters, generalizes to unseen data, enables rapid inference of the number of clusters, and is supported by rigorous theoretical guarantees. Comparative evaluations with state-of-the-art DCC methods on diverse benchmarks, along with empirical validation of theoretical insights, confirm its superior performance, scalability, and overall real-world effectiveness. Code is available at \href{https://github.com/spherepaircc/SpherePairCC/tree/main}{our repository}.

Angular Constraint Embedding via SpherePair Loss for Constrained Clustering

TL;DR

SpherePair introduces an anchor-free, angular constraint embedding for constrained clustering that learns angular representations balancing pairwise constraints in a bounded space. The approach decouples representation learning from clustering by optimizing an angular loss with a reconstruction term, yielding spherical embeddings where positive pairs cluster together and negative pairs occupy a defined negative zone. Theoretical results establish the conditions for a conflict-free embedding, the required embedding dimension relative to the true cluster count , and a PCA-based method to infer without retraining. Empirically, SpherePair outperforms state-of-the-art DCC baselines across diverse datasets, handles unknown cluster numbers, and exhibits robustness to constraint imbalance, with practical guidance for hyperparameters. These properties make SpherePair particularly suitable for scalable, real-world constrained clustering tasks where the true number of clusters is not known a priori.

Abstract

Constrained clustering integrates domain knowledge through pairwise constraints. However, existing deep constrained clustering (DCC) methods are either limited by anchors inherent in end-to-end modeling or struggle with learning discriminative Euclidean embedding, restricting their scalability and real-world applicability. To avoid their respective pitfalls, we propose a novel angular constraint embedding approach for DCC, termed SpherePair. Using the SpherePair loss with a geometric formulation, our method faithfully encodes pairwise constraints and leads to embeddings that are clustering-friendly in angular space, effectively separating representation learning from clustering. SpherePair preserves pairwise relations without conflict, removes the need to specify the exact number of clusters, generalizes to unseen data, enables rapid inference of the number of clusters, and is supported by rigorous theoretical guarantees. Comparative evaluations with state-of-the-art DCC methods on diverse benchmarks, along with empirical validation of theoretical insights, confirm its superior performance, scalability, and overall real-world effectiveness. Code is available at \href{https://github.com/spherepaircc/SpherePairCC/tree/main}{our repository}.

Paper Structure

This paper contains 95 sections, 7 theorems, 55 equations, 23 figures, 6 tables, 2 algorithms.

Key Result

Proposition 4.1

Let $\mathcal{S}^* = \{\mathcal{S}_k^*\}_{k=1}^K$ be the ground-truth partition of $\mathcal{X} = \{\boldsymbol{x}_j\}_{j=1}^{|\mathcal{X}|}$. An optimal angular representation $\mathcal{Z}^* = \{\boldsymbol{z}_j^*\}_{j=1}^{|\mathcal{X}|} \subset \mathbb{R}^D$ achieves $\mathcal{L}_{\mathrm{ang}}=0$

Figures (23)

  • Figure 1: Different pairwise learning approaches. End-to-end DCC introduces anchors to transform features in (a) into soft cluster assignments in (b) for pairwise losses. Deep constraint embedding in (c) focuses on the Euclidean distances between features, while ours in (d) operates in angular space.
  • Figure 2: $\mathcal{Z}$ change in the SpherePair embedding learning (from left to right): the angular distances of positive pairs decrease, while those of negative pairs gradually adhere to the negative zone $\frac{\pi}{\omega}$.
  • Figure 3: Test ACC performance (mean$\pm$std over 5 runs) of all models across datasets under the balanced vs. imbalanced constraints setting where ($\mid$ IMB0$\mid$, $\mid$ IMB1$\mid$, $\mid$ IMB2$\mid$) = (10k, 50k, 100k).
  • Figure 4: t-SNE visualizations of learned FMNIST embeddings under the IMB2 setting in Fig. \ref{['fig:imb_10K_samll']}. Marker colors denote ground-truth categories, and dashed lines represent pairwise constraints. The red circles highlight the misclustered instances.
  • Figure 5: Tail-averaged minimal inter-cluster angle $\overline{\delta}_d$ vs. PCA subspace dimension $d$, obtained from SpherePair embeddings learned with 10k constraints across 5 runs. The red lines indicate the ground-truth intrinsic dimensions $d^\ast = K{-}1$.
  • ...and 18 more figures

Theorems & Definitions (14)

  • Proposition 4.1: Conflict-free
  • Proposition 4.2: Equidistance
  • Corollary 4.3: Geometric Deviations under Near-zero Residual Loss
  • Theorem 4.4: Existence of Valid $\omega$
  • Corollary 4.5: Minimal Admissible $\omega$
  • Theorem 4.6: Pairwise-angle Invariance
  • Corollary 4.7: $\delta_d$ Invariance
  • proof
  • proof
  • proof
  • ...and 4 more