Table of Contents
Fetching ...

Learning Unified Distance Metric for Heterogeneous Attribute Data Clustering

Yiqun Zhang, Mingjie Zhao, Yizhou Chen, Yang Lu, Yiu-ming Cheung

TL;DR

This paper proposes a novel Heterogeneous Attribute Reconstruction and Representation (HARR) learning paradigm accordingly for cluster analysis that transforms heterogeneous attributes into a homogeneous status for distance metric learning, and integrates the learning with clustering to automatically adapt the metric to different clustering tasks.

Abstract

Datasets composed of numerical and categorical attributes (also called mixed data hereinafter) are common in real clustering tasks. Differing from numerical attributes that indicate tendencies between two concepts (e.g., high and low temperature) with their values in well-defined Euclidean distance space, categorical attribute values are different concepts (e.g., different occupations) embedded in an implicit space. Simultaneously exploiting these two very different types of information is an unavoidable but challenging problem, and most advanced attempts either encode the heterogeneous numerical and categorical attributes into one type, or define a unified metric for them for mixed data clustering, leaving their inherent connection unrevealed. This paper, therefore, studies the connection among any-type of attributes and proposes a novel Heterogeneous Attribute Reconstruction and Representation (HARR) learning paradigm accordingly for cluster analysis. The paradigm transforms heterogeneous attributes into a homogeneous status for distance metric learning, and integrates the learning with clustering to automatically adapt the metric to different clustering tasks. Differing from most existing works that directly adopt defined distance metrics or learn attribute weights to search clusters in a subspace. We propose to project the values of each attribute into unified learnable multiple spaces to more finely represent and learn the distance metric for categorical data. HARR is parameter-free, convergence-guaranteed, and can more effectively self-adapt to different sought number of clusters $k$. Extensive experiments illustrate its superiority in terms of accuracy and efficiency.

Learning Unified Distance Metric for Heterogeneous Attribute Data Clustering

TL;DR

This paper proposes a novel Heterogeneous Attribute Reconstruction and Representation (HARR) learning paradigm accordingly for cluster analysis that transforms heterogeneous attributes into a homogeneous status for distance metric learning, and integrates the learning with clustering to automatically adapt the metric to different clustering tasks.

Abstract

Datasets composed of numerical and categorical attributes (also called mixed data hereinafter) are common in real clustering tasks. Differing from numerical attributes that indicate tendencies between two concepts (e.g., high and low temperature) with their values in well-defined Euclidean distance space, categorical attribute values are different concepts (e.g., different occupations) embedded in an implicit space. Simultaneously exploiting these two very different types of information is an unavoidable but challenging problem, and most advanced attempts either encode the heterogeneous numerical and categorical attributes into one type, or define a unified metric for them for mixed data clustering, leaving their inherent connection unrevealed. This paper, therefore, studies the connection among any-type of attributes and proposes a novel Heterogeneous Attribute Reconstruction and Representation (HARR) learning paradigm accordingly for cluster analysis. The paradigm transforms heterogeneous attributes into a homogeneous status for distance metric learning, and integrates the learning with clustering to automatically adapt the metric to different clustering tasks. Differing from most existing works that directly adopt defined distance metrics or learn attribute weights to search clusters in a subspace. We propose to project the values of each attribute into unified learnable multiple spaces to more finely represent and learn the distance metric for categorical data. HARR is parameter-free, convergence-guaranteed, and can more effectively self-adapt to different sought number of clusters . Extensive experiments illustrate its superiority in terms of accuracy and efficiency.
Paper Structure (21 sections, 7 theorems, 22 equations, 11 figures, 8 tables, 2 algorithms)

This paper contains 21 sections, 7 theorems, 22 equations, 11 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

Distance measure yielded by the proposed projection-based representation is a distance metric.

Figures (11)

  • Figure 1: Conventional taxonomy of attributes, which shows the traditional classification of attributes into Numerical and Categorical types. Categorical Attributes are further divided into Ordinal Attributes, with inherent order (e.g., small, medium, large), and Nominal Attributes, without natural order (e.g., colors or labels). This taxonomy emphasizes the distinct characteristics of each attribute type.
  • Figure 2: Comparison of the "concepts" of numerical attribute, nominal attribute, and ordinal attribute, which illustrates how "concepts" and "values" are represented across different attribute types. Numerical attributes, like "income," maintain a continuous order (e.g., high to low). Nominal attributes, such as "occupation," consist of distinct categories without inherent order (e.g., nurse, lawyer, driver). Ordinal attributes, like "recommendation," contain ordered categories (e.g., accept, weak accept, marginal, reject), highlighting their hierarchical structure.
  • Figure 3: Relationships among different types of attributes in base distances computation. "A$\rightarrow$B" means that base distances of Type-B attributes are computed with the contribution of Type-A attributes.
  • Figure 4: Projection processes of a categorical attribute $a^r$. An attribute $a^r$ with $4$ possible values is expanded into $6$ distinct attributes after the projection process. Each projected attribute represents unique information, highlighting the rich information content of the original attribute.
  • Figure 5: Comparison of Hamming distance, one-hot encoding, and categorical attribute representation obtained through our method. It is obvious that our representation is more informative, and provides a homogeneous basis for distance computation on heterogeneous attributes.
  • ...and 6 more figures

Theorems & Definitions (11)

  • Theorem 1
  • proof
  • Lemma 1
  • Lemma 2
  • Theorem 2
  • proof
  • Lemma 3
  • Theorem 3
  • proof
  • Theorem 4
  • ...and 1 more