Table of Contents
Fetching ...

Discriminative Entropy Clustering and its Relation to K-means and SVM

Zhongwen Zhang, Yuri Boykov

TL;DR

The margin maximizing property for decisiveness establishing a relation to SVM-based clustering is proved and a new self-labeling formulation of entropy clustering for general softmax models is proposed.

Abstract

Maximization of mutual information between the model's input and output is formally related to "decisiveness" and "fairness" of the softmax predictions, motivating these unsupervised entropy-based criteria for clustering. First, in the context of linear softmax models, we discuss some general properties of entropy-based clustering. Disproving some earlier claims, we point out fundamental differences with K-means. On the other hand, we prove the margin maximizing property for decisiveness establishing a relation to SVM-based clustering. Second, we propose a new self-labeling formulation of entropy clustering for general softmax models. The pseudo-labels are introduced as auxiliary variables "splitting" the fairness and decisiveness. The derived self-labeling loss includes the reverse cross-entropy robust to pseudo-label errors and allows an efficient EM solver for pseudo-labels. Our algorithm improves the state of the art on several standard benchmarks for deep clustering.

Discriminative Entropy Clustering and its Relation to K-means and SVM

TL;DR

The margin maximizing property for decisiveness establishing a relation to SVM-based clustering is proved and a new self-labeling formulation of entropy clustering for general softmax models is proposed.

Abstract

Maximization of mutual information between the model's input and output is formally related to "decisiveness" and "fairness" of the softmax predictions, motivating these unsupervised entropy-based criteria for clustering. First, in the context of linear softmax models, we discuss some general properties of entropy-based clustering. Disproving some earlier claims, we point out fundamental differences with K-means. On the other hand, we prove the margin maximizing property for decisiveness establishing a relation to SVM-based clustering. Second, we propose a new self-labeling formulation of entropy clustering for general softmax models. The pseudo-labels are introduced as auxiliary variables "splitting" the fairness and decisiveness. The derived self-labeling loss includes the reverse cross-entropy robust to pseudo-label errors and allows an efficient EM solver for pseudo-labels. Our algorithm improves the state of the art on several standard benchmarks for deep clustering.
Paper Structure (26 sections, 4 theorems, 72 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 4 theorems, 72 equations, 12 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Consider any given linearly separable labeling ${\bf y}:=\{y_i\}_{i=1}^N$ for dataset $\{X_i\}_{i=1}^N$. Assuming linear classifier ${\bf v}(\gamma)$ minimizes regularized logistic loss over classifiers ${\bf v}\in{\bf V}^{\bf y}$ consistent with some given (e.g. ground truth) labeling ${\bf y}$ then where ${\bf u}^{\bf y}$ is a unit-norm max-margin linear classifier for ${\bf y}$.

Figures (12)

  • Figure 1: Entropy clustering vs. K-means - binary example ($K=2$) for 2D data $\{X_i\}$ comparing two linear methods of similar parametric complexity: (a) $K$-means $\mu_k\in{\cal R}^2$ and (b) entropy clustering \ref{['eq:mi']} with a linear model \ref{['eq:postmodel_shallow']} defined by $K$-column matrix ${\bf v}=[{\bf v}_k]$ with linear discriminants ${\bf v}_k\in{\cal R}^{2+1}$ (incl. bias). Red and green colors in (a) and (b) illustrate decision/prediction functions, $k_\mu(X) :=\arg\min_k\, \|X-\mu_k\|$ and $\sigma_{\bf v}(X):=\sigma({\bf v}^\top X)$, corresponding to the optimal parameters $\mu$ and ${\bf v}$ minimizing two losses: (a) compactness or variance of clusters $\sum_{i} \|X_i-\mu_{k_i}\|^2$ where $k_i=k_\mu(X_i)$, and (b) decisiveness and fairness$\overline{H(\sigma)} - H ( \bar{\sigma} )$, see \ref{['eq:mi']}, where $\sigma_i = \sigma({\bf v}^\top X_i)$. Color transparency in (b) visualizes "soft" decisions $\sigma({\bf v}^\top X)$; the linear boundary "blur" is proportional to $\frac{1}{\|v\|}$. Unlike low-variance (a), the optimal clusters in (b) have the maximum margin among all fair/balanced solutions, assuming "infinitesimal" norm regularization $\|{\bf v}\|^2$ discussed in Sec.\ref{['sec:SVM']}.
  • Figure 2: Model regularization is required: (a) arbitrarily complex model \ref{['eq:postmodel_deep']} can create any random clusters of the input $\{X_i\}$ despite the linear partitioning in the space of deep features $f_{\bf w}(X)$. Self-augmentation loss \ref{['eq:self-augment loss']} regularizes the clusters (b), as well as the mapping $f$. Indeed, similar inputs $\{X,X'\}$ are mapped to features equidistant from the decision boundary. Stronger forms of isometry can be enforced by contrastive losseschopra2005learningschroff2015facenetsohn2016improvedchen2020simple, particularly if negative pairs are also available. In general, model regularization with domain-specific constraints or augmentation is important for the quality of clustering.
  • Figure 3: Global & local minima: linear regularized entropy clustering (rEC) versus soft K-means (sKM). For both losses, global optima (a) and (d) are consistent for all $\gamma\in(0,0.00001]$. Variations in the optimal loss values are negligible. sKM is nearly identical to hard K-means for such $\gamma$; it softens only for larger $\gamma$, see Figure \ref{['fig:gamma']}. The local minimum for sKM (c) is obtained by Lloyd's algorithm initialized at (a). Vice-versa, gradient descent for rEC converges to (a) from (c). The same "cross-check" works for (b) and (d). Local minima for rEC (a,b) are balanced clusterings with (locally) maximum margins. In contrast, local minima for sKM (c,d) are orthogonal bisectors for the cluster centers. K-means ignores the margins.
  • Figure 4: Global minima for various $\gamma$: linear regularized entropy clustering (rEC) versus soft K-means (sKM). As $\gamma\rightarrow 0$, both approaches converge to hard, but different, clusters. Optimal/low-variance sKM clusters are consistent for all $\gamma$. In contrast, rEC produces max-margin clusters for small $\gamma$ and changes the solution for larger $\gamma_4$. The latter reduces norm $\|{\bf v}\|$ implying a wider "indecisiveness" zone around the linear decision boundary. Due to the decisiveness term in \ref{['eq:mi']}, entropy clustering finds the boundary minimizing the overlap between the data and such "softness" zone, explaining the result for $\gamma_4$.
  • Figure 5: Renyi entropy Renyi1961 of order $\alpha$: assuming $K=2$ the plots above show $R_\alpha({\sigma})$ in \ref{['eq:binary_Ra_def']} for binary distributions ${\sigma}=({\varsigma},1-{\varsigma})$ as functions of scalar ${\varsigma}\in[0,1]$. Shannon entropy \ref{['eq:binary_H_def']} is a special case corresponding to $\alpha=1$.
  • ...and 7 more figures

Theorems & Definitions (11)

  • Theorem 1: max-margin for logistic regression rosset2003margin
  • Definition 1
  • Definition 2
  • Theorem 2: max-margin clustering for $R_\infty$
  • proof
  • proof
  • Theorem 3: max-margin clustering for $R_\alpha$
  • proof
  • Definition 3
  • Definition 4
  • ...and 1 more