Table of Contents
Fetching ...

Robust Clustering on High-Dimensional Data with Stochastic Quantization

Anton Kozyriev, Vladimir Norkin

TL;DR

The paper tackles scalable clustering in high-dimensional settings where traditional algorithms demand excessive memory and lack robust convergence guarantees. It introduces Stochastic Quantization (SQ), recasting the objective as $F(y)=\mathbb{E}_{\xi}[ f(y,\xi) ]$ with $f(y,\xi)=\min_{1\le k\le K} d(\xi,y_k)^r$ and solving via SGD updates $y_k^{t+1}= \Pi_Y(y_k^t - \rho_t g_k(\tilde{\xi}^t))$, while leveraging a Triplet Network to embed data into a latent space and mitigate the curse of dimensionality. Key contributions include local convergence guarantees for the non-smooth, non-convex SQ objective under stochastic-gradient theory, adaptive-learning-rate variants to speed convergence, and a semi-supervised MNIST demonstration showing strong performance with partial labeling. The work provides a scalable framework for high-dimensional clustering and semi-supervised learning with practical implications for large datasets and annotation-limited scenarios, and outlines avenues for extending to unsupervised settings and additional contrastive losses.

Abstract

This paper addresses the limitations of conventional vector quantization algorithms, particularly K-Means and its variant K-Means++, and investigates the Stochastic Quantization (SQ) algorithm as a scalable alternative for high-dimensional unsupervised and semi-supervised learning tasks. Traditional clustering algorithms often suffer from inefficient memory utilization during computation, necessitating the loading of all data samples into memory, which becomes impractical for large-scale datasets. While variants such as Mini-Batch K-Means partially mitigate this issue by reducing memory usage, they lack robust theoretical convergence guarantees due to the non-convex nature of clustering problems. In contrast, the Stochastic Quantization algorithm provides strong theoretical convergence guarantees, making it a robust alternative for clustering tasks. We demonstrate the computational efficiency and rapid convergence of the algorithm on an image classification problem with partially labeled data, comparing model accuracy across various ratios of labeled to unlabeled data. To address the challenge of high dimensionality, we employ a Triplet Network to encode images into low-dimensional representations in a latent space, which serve as a basis for comparing the efficiency of both the Stochastic Quantization algorithm and traditional quantization algorithms. Furthermore, we enhance the algorithm's convergence speed by introducing modifications with an adaptive learning rate.

Robust Clustering on High-Dimensional Data with Stochastic Quantization

TL;DR

The paper tackles scalable clustering in high-dimensional settings where traditional algorithms demand excessive memory and lack robust convergence guarantees. It introduces Stochastic Quantization (SQ), recasting the objective as with and solving via SGD updates , while leveraging a Triplet Network to embed data into a latent space and mitigate the curse of dimensionality. Key contributions include local convergence guarantees for the non-smooth, non-convex SQ objective under stochastic-gradient theory, adaptive-learning-rate variants to speed convergence, and a semi-supervised MNIST demonstration showing strong performance with partial labeling. The work provides a scalable framework for high-dimensional clustering and semi-supervised learning with practical implications for large datasets and annotation-limited scenarios, and outlines avenues for extending to unsupervised settings and additional contrastive losses.

Abstract

This paper addresses the limitations of conventional vector quantization algorithms, particularly K-Means and its variant K-Means++, and investigates the Stochastic Quantization (SQ) algorithm as a scalable alternative for high-dimensional unsupervised and semi-supervised learning tasks. Traditional clustering algorithms often suffer from inefficient memory utilization during computation, necessitating the loading of all data samples into memory, which becomes impractical for large-scale datasets. While variants such as Mini-Batch K-Means partially mitigate this issue by reducing memory usage, they lack robust theoretical convergence guarantees due to the non-convex nature of clustering problems. In contrast, the Stochastic Quantization algorithm provides strong theoretical convergence guarantees, making it a robust alternative for clustering tasks. We demonstrate the computational efficiency and rapid convergence of the algorithm on an image classification problem with partially labeled data, comparing model accuracy across various ratios of labeled to unlabeled data. To address the challenge of high dimensionality, we employ a Triplet Network to encode images into low-dimensional representations in a latent space, which serve as a basis for comparing the efficiency of both the Stochastic Quantization algorithm and traditional quantization algorithms. Furthermore, we enhance the algorithm's convergence speed by introducing modifications with an adaptive learning rate.
Paper Structure (9 sections, 2 theorems, 37 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 9 sections, 2 theorems, 37 equations, 5 figures, 1 table, 1 algorithm.

Key Result

lemma thmcounterlemma

In the global optimum $y^{*} = (y_1^{*}, \ldots, y_K^{*})$ of (sq-objective-constraints:eq), all $\{y_1^{*}, \ldots, y_K^{*}\}$ belong to the convex hull of elements $\{\xi_1, \ldots, \xi_I\}$ in the feature set.

Figures (5)

  • Figure 1: Triplet Network structure for MNIST dataset lecun2010mnist.
  • Figure 2: Representative samples from the MNIST dataset lecun2010mnist with their corresponding labels.
  • Figure 3: Latent representations of images in the train dataset (left) and test dataset (right) projected by the Triplet Network, with each element color-coded according to its label (0-9). The clustering of elements with the same label suggests that the Triplet Network successfully captured relevant features during training.
  • Figure 4: Optimal positions of quants for each Stochastic Quantization variant (labeled by variant name) in the latent space, relative to labeled elements (0-9) from the test dataset.
  • Figure 5: Convergence speed comparison of Stochastic Quantization variants on latent representations of training data with 100% labeled fraction.

Theorems & Definitions (5)

  • definition thmcounterdefinition
  • lemma thmcounterlemma
  • proof
  • theorem thmcountertheorem
  • definition thmcounterdefinition