Table of Contents
Fetching ...

An objective function for order preserving hierarchical clustering

Daniel Bakkelund

TL;DR

This work reframes hierarchical clustering to preserve order within probabilistic partial orders and DAGs by augmenting Dasgupta’s similarity-based objective with an order-preservation component. It introduces a relaxed ordered framework using $\omega$, antisymmetrisation $g$, and a combined objective $f=s_d+g$, yielding a bi-objective optimization whose extremes recover pure clustering or pure ordering. The authors prove that optimal trees under special cases are order-preserving, analyze performance under planted partial orders, and demonstrate a polynomial-time approximation with a guarantee of $O(\log^{3/2} n)$ via a directed sparsest-cut approach. A thorough demonstration on a machine-parts dataset shows advantages over existing order-preserving methods, highlighting improved clustering quality while maintaining order constraints. The work contributes a formal definition of order-preserving hierarchical clustering, a concrete objective combining similarity and order, and an actionable approximation algorithm with practical validation and guidance for future theory and implementations.

Abstract

We present a theory and an objective function for similarity-based hierarchical clustering of probabilistic partial orders and directed acyclic graphs (DAGs). Specifically, given elements $x \le y$ in the partial order, and their respective clusters $[x]$ and $[y]$, the theory yields an order relation $\le'$ on the clusters such that $[x]\le'[y]$. The theory provides a concise definition of order-preserving hierarchical clustering, and offers a classification theorem identifying the order-preserving trees (dendrograms). To determine the optimal order-preserving trees, we develop an objective function that frames the problem as a bi-objective optimisation, aiming to satisfy both the order relation and the similarity measure. We prove that the optimal trees under the objective are both order-preserving and exhibit high-quality hierarchical clustering. Since finding an optimal solution is NP-hard, we introduce a polynomial-time approximation algorithm and demonstrate that the method outperforms existing methods for order-preserving hierarchical clustering by a significant margin.

An objective function for order preserving hierarchical clustering

TL;DR

This work reframes hierarchical clustering to preserve order within probabilistic partial orders and DAGs by augmenting Dasgupta’s similarity-based objective with an order-preservation component. It introduces a relaxed ordered framework using , antisymmetrisation , and a combined objective , yielding a bi-objective optimization whose extremes recover pure clustering or pure ordering. The authors prove that optimal trees under special cases are order-preserving, analyze performance under planted partial orders, and demonstrate a polynomial-time approximation with a guarantee of via a directed sparsest-cut approach. A thorough demonstration on a machine-parts dataset shows advantages over existing order-preserving methods, highlighting improved clustering quality while maintaining order constraints. The work contributes a formal definition of order-preserving hierarchical clustering, a concrete objective combining similarity and order, and an actionable approximation algorithm with practical validation and guidance for future theory and implementations.

Abstract

We present a theory and an objective function for similarity-based hierarchical clustering of probabilistic partial orders and directed acyclic graphs (DAGs). Specifically, given elements in the partial order, and their respective clusters and , the theory yields an order relation on the clusters such that . The theory provides a concise definition of order-preserving hierarchical clustering, and offers a classification theorem identifying the order-preserving trees (dendrograms). To determine the optimal order-preserving trees, we develop an objective function that frames the problem as a bi-objective optimisation, aiming to satisfy both the order relation and the similarity measure. We prove that the optimal trees under the objective are both order-preserving and exhibit high-quality hierarchical clustering. Since finding an optimal solution is NP-hard, we introduce a polynomial-time approximation algorithm and demonstrate that the method outperforms existing methods for order-preserving hierarchical clustering by a significant margin.

Paper Structure

This paper contains 36 sections, 20 theorems, 63 equations, 9 figures, 4 tables.

Key Result

Theorem 3

Let $(X,\le)$ be an ordered set, and let $\mathcal{C}$ be a clustering of $X$. Then the following two statements are equivalent:

Figures (9)

  • Figure 1: A selection of possible ordered clusterings of the set $\{a,b,c,d\}$. Possible interpretations in terms of the motivating use case are given together with the clusterings. All but $6)$ are examples of order preserving clusterings. In $6)$, the part-of relations constitute a cycle, implying that the parts are proper sub-parts of themselves, which is a contradiction.
  • Figure 2: A hierarchical clustering of the set $X=\{1,\ldots,5\}$.
  • Figure 3: Figure showing the result of the clustering of the migration data. The binary tree is displayed to the left, and the sequence of splits of the states are shown to the right. The arrows on the splits indicate the direction of net migration.
  • Figure 4: John F. Kennedy's ancestral tree, two generations back.
  • Figure 5: The optimal hierarchical clustering of the Kennedy family tree. We have left out the final splits into leaf nodes.
  • ...and 4 more figures

Theorems & Definitions (51)

  • Definition 1
  • Definition 2
  • Theorem 3: Blyth2005
  • Theorem 4
  • Definition 5
  • proof : Proof of Theorem \ref{['thm:op-hc']}
  • Definition 6
  • Lemma 7
  • proof
  • Lemma 8
  • ...and 41 more