Table of Contents
Fetching ...

Learning-Augmented K-Means Clustering Using Dimensional Reduction

Issam K. O Jabari, Shofiyah, Pradiptya Kahvi S, Novi Nur Putriwijaya, Novanto Yudistira

TL;DR

This work tackles improving clustering quality for high-dimensional data by integrating a learning-augmented predictor within the k-means framework and applying PCA-based dimensionality reduction. It introduces PredictorClustering, which uses PCA to project data into a lower-dimensional space, uses a predictor to initialize cluster assignments, and iteratively refines clusters. Empirical results on the Oregon Graphs, PHY, and CIFAR-10 datasets show that PCA before clustering lowers the cost and improves robustness to predictor noise, especially for larger values of $k$ such as $k=10$ and $k=25$, relative to standard k-means. Overall, the paper demonstrates that dimensionality reduction can alleviate local minima and computational costs in learning-augmented clustering, offering practical guidance for high-dimensional clustering tasks.

Abstract

Learning augmented is a machine learning concept built to improve the performance of a method or model, such as enhancing its ability to predict and generalize data or features, or testing the reliability of the method by introducing noise and other factors. On the other hand, clustering is a fundamental aspect of data analysis and has long been used to understand the structure of large datasets. Despite its long history, the k-means algorithm still faces challenges. One approach, as suggested by Ergun et al,is to use a predictor to minimize the sum of squared distances between each data point and a specified centroid. However, it is known that the computational cost of this algorithm increases with the value of k, and it often gets stuck in local minima. In response to these challenges, we propose a solution to reduce the dimensionality of the dataset using Principal Component Analysis (PCA). It is worth noting that when using k values of 10 and 25, the proposed algorithm yields lower cost results compared to running it without PCA. "Principal component analysis (PCA) is the problem of fitting a low-dimensional affine subspace to a set of data points in a high-dimensional space. PCA is well-established in the literature and has become one of the most useful tools for data modeling, compression, and visualization."

Learning-Augmented K-Means Clustering Using Dimensional Reduction

TL;DR

This work tackles improving clustering quality for high-dimensional data by integrating a learning-augmented predictor within the k-means framework and applying PCA-based dimensionality reduction. It introduces PredictorClustering, which uses PCA to project data into a lower-dimensional space, uses a predictor to initialize cluster assignments, and iteratively refines clusters. Empirical results on the Oregon Graphs, PHY, and CIFAR-10 datasets show that PCA before clustering lowers the cost and improves robustness to predictor noise, especially for larger values of such as and , relative to standard k-means. Overall, the paper demonstrates that dimensionality reduction can alleviate local minima and computational costs in learning-augmented clustering, offering practical guidance for high-dimensional clustering tasks.

Abstract

Learning augmented is a machine learning concept built to improve the performance of a method or model, such as enhancing its ability to predict and generalize data or features, or testing the reliability of the method by introducing noise and other factors. On the other hand, clustering is a fundamental aspect of data analysis and has long been used to understand the structure of large datasets. Despite its long history, the k-means algorithm still faces challenges. One approach, as suggested by Ergun et al,is to use a predictor to minimize the sum of squared distances between each data point and a specified centroid. However, it is known that the computational cost of this algorithm increases with the value of k, and it often gets stuck in local minima. In response to these challenges, we propose a solution to reduce the dimensionality of the dataset using Principal Component Analysis (PCA). It is worth noting that when using k values of 10 and 25, the proposed algorithm yields lower cost results compared to running it without PCA. "Principal component analysis (PCA) is the problem of fitting a low-dimensional affine subspace to a set of data points in a high-dimensional space. PCA is well-established in the literature and has become one of the most useful tools for data modeling, compression, and visualization."
Paper Structure (12 sections, 12 figures, 1 table)

This paper contains 12 sections, 12 figures, 1 table.

Figures (12)

  • Figure 1: Plot of Oregon Dataset Dimension without Using PCA
  • Figure 2: Plot of Oregon Dataset Dimension Using PCA
  • Figure 3: Plot of PHY Dataset Dimension without Using PCA
  • Figure 4: Plot of PHY Dataset Dimension Using PCA
  • Figure 5: Plot of CIFAR10 Dataset Dimension without Using PCA
  • ...and 7 more figures