Table of Contents
Fetching ...

ABCDE: Application-Based Cluster Diff Evals

Stephan van Staden, Alexander Grubb

TL;DR

ABCDE provides a ground-truth-agnostic framework for evaluating two whole clusterings of massive populations by separating impact and quality metrics. It leverages the pointwise clustering metrics to define per-item splits, merges, and similarity, then lifts them to population-level measures with weighted averages and sampling. The method enables interactive slice-based exploration while offering statistically sound estimations of delta precision and Good/Bad split-merge rates through carefully designed human judgments and importance sampling. Its scalable, on-demand evaluation targets the actual clustering changes and supports debugging, refinement, and robust decision-making under resource constraints. The combination of ground-truth-aware ground truth limitations and scalable sampling makes ABCDE practically valuable for developers and practitioners operating at scale.

Abstract

This paper considers the problem of evaluating clusterings of very large populations of items. Given two clusterings, namely a Baseline clustering and an Experiment clustering, the tasks are twofold: 1) characterize their differences, and 2) determine which clustering is better. ABCDE is a novel evaluation technique for accomplishing that. It aims to be practical: it allows items to have associated importance values that are application-specific, it is frugal in its use of human judgements when determining which clustering is better, and it can report metrics for arbitrary slices of items, thereby facilitating understanding and debugging. The approach to measuring the delta in the clustering quality is novel: instead of trying to construct an expensive ground truth up front and evaluating the each clustering with respect to that, where the ground truth must effectively pre-anticipate clustering changes, ABCDE samples questions for judgement on the basis of the actual diffs between the clusterings. ABCDE builds upon the pointwise metrics for clustering evaluation, which make the ABCDE metrics intuitive and simple to understand. The mathematical elegance of the pointwise metrics equip ABCDE with rigorous yet practical ways to explore the clustering diffs and to estimate the quality delta.

ABCDE: Application-Based Cluster Diff Evals

TL;DR

ABCDE provides a ground-truth-agnostic framework for evaluating two whole clusterings of massive populations by separating impact and quality metrics. It leverages the pointwise clustering metrics to define per-item splits, merges, and similarity, then lifts them to population-level measures with weighted averages and sampling. The method enables interactive slice-based exploration while offering statistically sound estimations of delta precision and Good/Bad split-merge rates through carefully designed human judgments and importance sampling. Its scalable, on-demand evaluation targets the actual clustering changes and supports debugging, refinement, and robust decision-making under resource constraints. The combination of ground-truth-aware ground truth limitations and scalable sampling makes ABCDE practically valuable for developers and practitioners operating at scale.

Abstract

This paper considers the problem of evaluating clusterings of very large populations of items. Given two clusterings, namely a Baseline clustering and an Experiment clustering, the tasks are twofold: 1) characterize their differences, and 2) determine which clustering is better. ABCDE is a novel evaluation technique for accomplishing that. It aims to be practical: it allows items to have associated importance values that are application-specific, it is frugal in its use of human judgements when determining which clustering is better, and it can report metrics for arbitrary slices of items, thereby facilitating understanding and debugging. The approach to measuring the delta in the clustering quality is novel: instead of trying to construct an expensive ground truth up front and evaluating the each clustering with respect to that, where the ground truth must effectively pre-anticipate clustering changes, ABCDE samples questions for judgement on the basis of the actual diffs between the clusterings. ABCDE builds upon the pointwise metrics for clustering evaluation, which make the ABCDE metrics intuitive and simple to understand. The mathematical elegance of the pointwise metrics equip ABCDE with rigorous yet practical ways to explore the clustering diffs and to estimate the quality delta.
Paper Structure (23 sections, 57 equations, 1 figure)

This paper contains 23 sections, 57 equations, 1 figure.

Figures (1)

  • Figure 1: The clustering situation from the perspective of item $i$. The item $i$ is always in the intersection of $\mathit{Base}(i)$ and $\mathit{Exp}(i)$, which is never empty. $\text{Split} = \mathit{Base}(i) \setminus \mathit{Exp}(i)$ denotes the set of items that got split off in $\mathit{Exp}$ from the perspective of $i$. $\text{Merge} = \mathit{Exp}(i) \setminus \mathit{Base}(i)$ denotes the set of items that are merged in $\mathit{Exp}$ from the perspective of $i$. $\text{Stable} = \mathit{Base}(i) \cap \mathit{Exp}(i)$ denotes the items that remained stable from the perspective of $i$.