Decomposing the Jaccard Distance and the Jaccard Index in ABCDE

Stephan van Staden

Decomposing the Jaccard Distance and the Jaccard Index in ABCDE

Stephan van Staden

TL;DR

The paper develops a decomposed framework for the Jaccard-based metrics used in ABCDE, splitting the magnitude of clustering differences into Split and Merge components and further into Good/Bad contributions, while also decomposing the JaccardIndex into affected and unaffected parts with analogous quality splits. It provides exact population-level definitions and practical, unbiased estimation procedures via weighted-sample pairwise judgments, including strategies for sampling and confidence intervals. The approach yields a rich set of metrics (including DeltaPrecision as a bonus) that are interrelated through simple equations, enabling deeper debugging and interactive exploration of clustering changes. The work positions itself as complementary to ABCDE, offering alternative perspectives and additional tooling for understanding the nature and quality of clustering changes, with guidance on stratified sampling and implementation considerations. The practical impact lies in enabling more nuanced, human-judged evaluation of large-scale clustering changes and providing a structured pathway to debugging diffs via item-pair and cluster-level analyses.

Abstract

ABCDE is a sophisticated technique for evaluating differences between very large clusterings. Its main metric that characterizes the magnitude of the difference between two clusterings is the JaccardDistance, which is a true distance metric in the space of all clusterings of a fixed set of (weighted) items. The JaccardIndex is the complementary metric that characterizes the similarity of two clusterings. Its relationship with the JaccardDistance is simple: JaccardDistance + JaccardIndex = 1. This paper decomposes the JaccardDistance and the JaccardIndex further. In each case, the decomposition yields Impact and Quality metrics. The Impact metrics measure aspects of the magnitude of the clustering diff, while Quality metrics use human judgements to measure how much the clustering diff improves the quality of the clustering. The decompositions of this paper offer more and deeper insight into a clustering change. They also unlock new techniques for debugging and exploring the nature of the clustering diff. The new metrics are mathematically well-behaved and they are interrelated via simple equations. While the work can be seen as an alternative formal framework for ABCDE, we prefer to view it as complementary. It certainly offers a different perspective on the magnitude and the quality of a clustering change, and users can use whatever they want from each approach to gain more insight into a change.

Decomposing the Jaccard Distance and the Jaccard Index in ABCDE

TL;DR

Abstract

Paper Structure (20 sections, 55 equations, 1 figure)

This paper contains 20 sections, 55 equations, 1 figure.

Introduction
High-level overview
Preliminaries
Estimating a weighted sum from a weighted sample
Decomposing the $\mathit{JaccardDistance}$
Decomposing the $\mathit{JaccardDistance}$ of individual items
Decomposing the overall $\mathit{JaccardDistance}$
Calculation and estimation
Summary of the estimation so far
Sampling pairs of items for understanding the overall $\mathit{JaccardDistance}$
Decomposing the $\mathit{JaccardIndex}$
Decomposing the $\mathit{JaccardIndex}$ of individual items
Decomposing the overall $\mathit{JaccardIndex}$
$\Delta\mathit{Precision}$ and its estimation
Relationship with the estimation of $\Delta\mathit{Precision}$ in vanstadengrubb2024abcde
...and 5 more sections

Figures (1)

Figure 1: The clustering quality situation from the perspective of item $i$. The item $i$ is always in the intersection of $\mathit{Base}(i)$ and $\mathit{Exp}(i)$ and $\mathit{Ideal}(i)$, which is never empty. $\mathit{Ideal}(i)$ is the set of all items that are truly equivalent to $i$. Each area inside the Venn diagram is labeled with its weight divided by $\mathit{weight}(\mathit{Base}(i) \cup \mathit{Exp}(i))$. To save space we omit the suffix '$(i)$' from the labels. So, for example, the label $\mathit{GoodSplitDistance}$ in the diagram stands for $\mathit{GoodSplitDistance}(i)$.

Decomposing the Jaccard Distance and the Jaccard Index in ABCDE

TL;DR

Abstract

Decomposing the Jaccard Distance and the Jaccard Index in ABCDE

Authors

TL;DR

Abstract

Table of Contents

Figures (1)