Table of Contents
Fetching ...

Learning Order Forest for Qualitative-Attribute Data Clustering

Mingjie Zhao, Sen Feng, Yiqun Zhang, Mengke Li, Yang Lu, Yiu-ming Cheung

TL;DR

A tree-like distance structure is discovered to flexibly represent the local order relationship among intra-attribute qualitative values and allow to capture rich order relationships among the vertex value and the others.

Abstract

Clustering is a fundamental approach to understanding data patterns, wherein the intuitive Euclidean distance space is commonly adopted. However, this is not the case for implicit cluster distributions reflected by qualitative attribute values, e.g., the nominal values of attributes like symptoms, marital status, etc. This paper, therefore, discovered a tree-like distance structure to flexibly represent the local order relationship among intra-attribute qualitative values. That is, treating a value as the vertex of the tree allows to capture rich order relationships among the vertex value and the others. To obtain the trees in a clustering-friendly form, a joint learning mechanism is proposed to iteratively obtain more appropriate tree structures and clusters. It turns out that the latent distance space of the whole dataset can be well-represented by a forest consisting of the learned trees. Extensive experiments demonstrate that the joint learning adapts the forest to the clustering task to yield accurate results. Comparisons of 10 counterparts on 12 real benchmark datasets with significance tests verify the superiority of the proposed method.

Learning Order Forest for Qualitative-Attribute Data Clustering

TL;DR

A tree-like distance structure is discovered to flexibly represent the local order relationship among intra-attribute qualitative values and allow to capture rich order relationships among the vertex value and the others.

Abstract

Clustering is a fundamental approach to understanding data patterns, wherein the intuitive Euclidean distance space is commonly adopted. However, this is not the case for implicit cluster distributions reflected by qualitative attribute values, e.g., the nominal values of attributes like symptoms, marital status, etc. This paper, therefore, discovered a tree-like distance structure to flexibly represent the local order relationship among intra-attribute qualitative values. That is, treating a value as the vertex of the tree allows to capture rich order relationships among the vertex value and the others. To obtain the trees in a clustering-friendly form, a joint learning mechanism is proposed to iteratively obtain more appropriate tree structures and clusters. It turns out that the latent distance space of the whole dataset can be well-represented by a forest consisting of the learned trees. Extensive experiments demonstrate that the joint learning adapts the forest to the clustering task to yield accurate results. Comparisons of 10 counterparts on 12 real benchmark datasets with significance tests verify the superiority of the proposed method.
Paper Structure (14 sections, 3 theorems, 9 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 3 theorems, 9 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

The trace distance measure $d_{r,u,s}$ defined in the context of the order tree $M_r$ represents a valid distance metric.

Figures (6)

  • Figure 1: An intuitive comparison of clustering performance by adopting different types of distance structures. (a) and (b) demonstrate the typical line graph and fully connected graph. (c) demonstrates the $k$-modes kmd clustering performance with the following distance structures: 1) Randomly Generated Graphs (RGGs, not necessarily fully connected but ensure all attribute values are connected), 2) Fully Connected Graphs (FCGs), 3) Randomly Generated Line Graphs (RGLGs), and 4) Semantic Line Graphs (SLGs, arrange possible values in the graph according to their semantic order). The RGGs and RGLGs involving randomization are implemented 50 times, and the clustering accuracy is sorted for better visualization.
  • Figure 2: Process of order tree construction. (a) A fully connected graph ${\@fontswitch\mathcal{G}}_r$ is prepared with a distance matrix reflecting the edge weights. (b) Prim or Kruskal algorithm is implemented to generate an order tree with a unique order trace between each pair of nodes, which is defined in Definition \ref{['def:trace']}.
  • Figure 3: CA performance of different ablated COForest versions.
  • Figure 4: Convergence curves of COForest on different datasets. $L$ represents the value of the objective function.
  • Figure 5: Execution time on synthetic datasets.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Remark 1: Generalization of relationship graph
  • Definition 1: Order trace
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3