Handling LP-Rounding for Hierarchical Clustering and Fitting Distances by Ultrametrics
Hyung-Chan An, Mong-Jen Kao, Changyeol Lee, Mu-Ting Lee
TL;DR
The paper studies hierarchical correlation clustering on a complete graph with $\ell$ layers, connecting the problem to fitting distances by ultrametrics (numerical taxonomy). It introduces a simple LP-rounding paradigm that achieves a $25.7846$-approximation by exploiting a key LP-property that bounds the contribution of non-edge pairs with distance $<1$ to the objective, and by decomposing the rounding into a pre-clustering step with diameter $<\frac{1}{3}$ and a bottom-up set-merging procedure guided by layer structure. The approach yields a clean interpretation as cuts with prescribed average distances and, as a corollary, yields a simple $(5)$-approximation for ultrametric-violation distance, matching or simplifying prior results. The framework opens avenues for improving hierarchical clustering objectives and extending to other ultrametric-fitting variants.
Abstract
We consider the classic correlation clustering problem in the hierarchical setting. Given a complete graph $G=(V,E)$ and $\ell$ layers of input information, where the input of each layer consists of a nonnegative weight and a labeling of the edges with either + or -, this problem seeks to compute for each layer a partition of $V$ such that the partition for any non-top layer subdivides the partition in the upper-layer and the weighted number of disagreements over the layers is minimized. Hierarchical correlation clustering is a natural formulation of the classic problem of fitting distances by ultrametrics, which is further known as numerical taxonomy in the literature. While single-layer correlation clustering received wide attention since it was introduced and major progress evolved in the past three years, few is known for this problem in the hierarchical setting. The lack of understanding and adequate tools is reflected in the large approximation ratio known for this problem originating from 2021. In this work we make both conceptual and technical contributions towards the hierarchical clustering problem. We present a simple paradigm that greatly facilitates LP-rounding in hierarchical clustering, illustrated with an algorithm providing a significantly improved approximation guarantee of 25.7846 for the hierarchical correlation clustering problem. Our techniques reveal surprising new properties of the formulation presented and subsequently used in previous works for hierarchical clustering over the past two decades. This provides an interpretation on the core problem in hierarchical clustering as the problem of finding cuts with prescribed properties regarding average distances. We further illustrate this perspective by showing that a direct application of the techniques gives a simple alternative to the state-of-the-art result for the ultrametric violation distance problem.
