Table of Contents
Fetching ...

New Approach to Clustering Random Attributes

Zenon Gniazdowski

TL;DR

The paper tackles clustering when attributes come from mixed data types by numerically encoding nominal attributes and applying exploratory factor analysis to the resulting numeric representation. After deriving a factor model and rotating with Varimax, it assigns attributes to similarity classes to form clusters via absolute or relative majority variance rules, enabling simultaneous clustering of numeric and encoded nominal attributes. Demonstrations on Weather Forecast, Mushroom, Automobile, and Breast Cancer datasets show the method’s ability to reveal meaningful attribute-factor groupings, though per-attribute variance reproduction and distributional assumptions influence results. The approach provides a principled, geometry-based framework for cross-type attribute clustering with potential applications in surveys and heterogeneous data analysis, while acknowledging limitations related to normality and encoding choices.

Abstract

This paper proposes a new method for similarity analysis and, consequently, a new algorithm for clustering different types of random attributes, both numerical and nominal. However, in order for nominal attributes to be clustered, their values must be properly encoded. In the encoding process, nominal attributes obtain a new representation in numerical form. Only the numeric attributes can be subjected to factor analysis, which allows them to be clustered in terms of their similarity to factors. The proposed method was tested for several sample datasets. It was found that the proposed method is universal. On the one hand, the method allows clustering of numerical attributes. On the other hand, it provides the ability to cluster nominal attributes. It also allows simultaneous clustering of numerical attributes and numerically encoded nominal attributes.

New Approach to Clustering Random Attributes

TL;DR

The paper tackles clustering when attributes come from mixed data types by numerically encoding nominal attributes and applying exploratory factor analysis to the resulting numeric representation. After deriving a factor model and rotating with Varimax, it assigns attributes to similarity classes to form clusters via absolute or relative majority variance rules, enabling simultaneous clustering of numeric and encoded nominal attributes. Demonstrations on Weather Forecast, Mushroom, Automobile, and Breast Cancer datasets show the method’s ability to reveal meaningful attribute-factor groupings, though per-attribute variance reproduction and distributional assumptions influence results. The approach provides a principled, geometry-based framework for cross-type attribute clustering with potential applications in surveys and heterogeneous data analysis, while acknowledging limitations related to normality and encoding choices.

Abstract

This paper proposes a new method for similarity analysis and, consequently, a new algorithm for clustering different types of random attributes, both numerical and nominal. However, in order for nominal attributes to be clustered, their values must be properly encoded. In the encoding process, nominal attributes obtain a new representation in numerical form. Only the numeric attributes can be subjected to factor analysis, which allows them to be clustered in terms of their similarity to factors. The proposed method was tested for several sample datasets. It was found that the proposed method is universal. On the one hand, the method allows clustering of numerical attributes. On the other hand, it provides the ability to cluster nominal attributes. It also allows simultaneous clustering of numerical attributes and numerically encoded nominal attributes.

Paper Structure

This paper contains 39 sections, 15 equations, 15 figures, 25 tables, 3 algorithms.

Figures (15)

  • Figure 1: Scree plot for Simple Weather Forecast dataset
  • Figure 2: Simple Weather Forecast dataset -- minimum variance (MinVar) and average variance (AverVar) of attributes, reconstructed by successive factors, shown against a normalized scree plot
  • Figure 3: Absolute similarity of attributes to factors for encoded Simple Weather Forecast dataset
  • Figure 4: Relative similarity of attributes to factors for encoded dataset Simple Weather Forecast
  • Figure 5: Mushroom dataset -- variances represented by several successive factors: ScreePlt, MinVar and AverVar
  • ...and 10 more figures