Table of Contents
Fetching ...

An Approach to Variable Clustering: K-means in Transposed Data and its Relationship with Principal Component Analysis

Victor Saquicela, Kenneth Palacio-Baus, Mario Chifla

TL;DR

The paper introduces a novel approach that clusters variables by applying K-means to the transposed data and relates these variable clusters to PCA directions. It standardizes data, performs PCA on the original matrix, transposes the data, clusters variables, and then quantifies each cluster's contribution to each principal component using sums of absolute loadings, producing S and P matrices that link variable groups to variance directions. Through experiments on datasets like USArrests, Iris, and Decathlon, the method reveals how variable clusters influence principal components, offering a complementary perspective to traditional PCA and observation-based clustering. The work provides a practical exploratory tool for high-dimensional data analysis and highlights avenues for further validation, methodological comparisons, and visualization enhancements.

Abstract

Principal Component Analysis (PCA) and K-means constitute fundamental techniques in multivariate analysis. Although they are frequently applied independently or sequentially to cluster observations, the relationship between them, especially when K-means is used to cluster variables rather than observations, has been scarcely explored. This study seeks to address this gap by proposing an innovative method that analyzes the relationship between clusters of variables obtained by applying K-means on transposed data and the principal components of PCA. Our approach involves applying PCA to the original data and K-means to the transposed data set, where the original variables are converted into observations. The contribution of each variable cluster to each principal component is then quantified using measures based on variable loadings. This process provides a tool to explore and understand the clustering of variables and how such clusters contribute to the principal dimensions of variation identified by PCA.

An Approach to Variable Clustering: K-means in Transposed Data and its Relationship with Principal Component Analysis

TL;DR

The paper introduces a novel approach that clusters variables by applying K-means to the transposed data and relates these variable clusters to PCA directions. It standardizes data, performs PCA on the original matrix, transposes the data, clusters variables, and then quantifies each cluster's contribution to each principal component using sums of absolute loadings, producing S and P matrices that link variable groups to variance directions. Through experiments on datasets like USArrests, Iris, and Decathlon, the method reveals how variable clusters influence principal components, offering a complementary perspective to traditional PCA and observation-based clustering. The work provides a practical exploratory tool for high-dimensional data analysis and highlights avenues for further validation, methodological comparisons, and visualization enhancements.

Abstract

Principal Component Analysis (PCA) and K-means constitute fundamental techniques in multivariate analysis. Although they are frequently applied independently or sequentially to cluster observations, the relationship between them, especially when K-means is used to cluster variables rather than observations, has been scarcely explored. This study seeks to address this gap by proposing an innovative method that analyzes the relationship between clusters of variables obtained by applying K-means on transposed data and the principal components of PCA. Our approach involves applying PCA to the original data and K-means to the transposed data set, where the original variables are converted into observations. The contribution of each variable cluster to each principal component is then quantified using measures based on variable loadings. This process provides a tool to explore and understand the clustering of variables and how such clusters contribute to the principal dimensions of variation identified by PCA.

Paper Structure

This paper contains 14 sections, 4 equations, 2 figures, 6 tables, 1 algorithm.

Figures (2)

  • Figure 1: PCA + Kmeans process
  • Figure 2: Results of the USArrests dataset