Table of Contents
Fetching ...

An algorithm for clustering with confidence-based must-link and cannot-link constraints

Philipp Baumann, Dorit S. Hochbaum

TL;DR

The pair-wise confidence constraints clustering algorithm is introduced, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects, and outperforms the state-of-the-art approaches on instances with all hard or all soft constraints in terms of both run time and various metrics of solution quality.

Abstract

We study here the semi-supervised $k$-clustering problem where information is available on whether pairs of objects are in the same or in different clusters. This information is either available with certainty or with a limited level of confidence. We introduce the PCCC (Pairwise-Confidence-Constraints-Clustering) algorithm, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects. Our algorithm uses integer programming for the assignment of objects which allows to include relationships as hard constraints that are guaranteed to be satisfied or as soft constraints that can be violated subject to a penalty. This flexibility distinguishes our algorithm from the state-of-the-art in which all pairwise constraints are either considered hard, or all are considered soft. We developed an enhanced multi-start approach and a model-size reduction technique for the integer program that contributes to the effectiveness and the efficiency of the algorithm. Unlike existing algorithms, our algorithm scales to large-scale instances with up to 60,000 objects, 100 clusters, and millions of cannot-link constraints (which are the most challenging constraints to incorporate). We compare the PCCC algorithm with state-of-the-art approaches in an extensive computational study. Even though the PCCC algorithm is more general than the state-of-the-art approaches in its applicability, it outperforms the state-of-the-art approaches on instances with all hard or all soft constraints both in terms of runtime and various metrics of solution quality. The code of the PCCC algorithm is publicly available on GitHub.

An algorithm for clustering with confidence-based must-link and cannot-link constraints

TL;DR

The pair-wise confidence constraints clustering algorithm is introduced, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects, and outperforms the state-of-the-art approaches on instances with all hard or all soft constraints in terms of both run time and various metrics of solution quality.

Abstract

We study here the semi-supervised -clustering problem where information is available on whether pairs of objects are in the same or in different clusters. This information is either available with certainty or with a limited level of confidence. We introduce the PCCC (Pairwise-Confidence-Constraints-Clustering) algorithm, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects. Our algorithm uses integer programming for the assignment of objects which allows to include relationships as hard constraints that are guaranteed to be satisfied or as soft constraints that can be violated subject to a penalty. This flexibility distinguishes our algorithm from the state-of-the-art in which all pairwise constraints are either considered hard, or all are considered soft. We developed an enhanced multi-start approach and a model-size reduction technique for the integer program that contributes to the effectiveness and the efficiency of the algorithm. Unlike existing algorithms, our algorithm scales to large-scale instances with up to 60,000 objects, 100 clusters, and millions of cannot-link constraints (which are the most challenging constraints to incorporate). We compare the PCCC algorithm with state-of-the-art approaches in an extensive computational study. Even though the PCCC algorithm is more general than the state-of-the-art approaches in its applicability, it outperforms the state-of-the-art approaches on instances with all hard or all soft constraints both in terms of runtime and various metrics of solution quality. The code of the PCCC algorithm is publicly available on GitHub.
Paper Structure (26 sections, 1 theorem, 2 equations, 14 figures, 6 tables)

This paper contains 26 sections, 1 theorem, 2 equations, 14 figures, 6 tables.

Key Result

Lemma 1

If a feasible assignment exists, we find a feasible assignment by solving model (R($q$)MBLP) with $q\geq\min(1 + \Delta(H), k)$.

Figures (14)

  • Figure 1: Illustrative example: input data (left) and ground truth (right).
  • Figure 2: Flowchart of the PCCC algorithm. Illustrations 1--9 are given in Figure \ref{['fig_algorithm_illustrations']}
  • Figure 3: Illustrations for flowchart based on the illustrative example. The algorithm is applied with parameter $q=2$, which explains why the hard cannot-link constraint between the contracted nodes 4 and 9 can be omitted in the assignment step (see illustrations 5 and 6).
  • Figure 4: Illustration of cluster repositioning with the synthetic data set n1000-k20 and the cannot-link constraints of constraint set 15% CS (provided as soft constraints). The left plot shows the converged solution with $q=2$ before the repositioning and the plot on the right shows the converged solution with $q=2$ after repositioning. Both the number of violations of cannot-link constraints as well as the total Inertia could be improved by the repositioning.
  • Figure 5: Illustration of the dynamic enlargement of the search space with the synthetic data set n1000-k20 and the cannot-link constraints of constraint set 15% CS (provided as soft constraints). The left plot highlights the $\gamma=50$ critical objects (red and orange) in the converged solution obtained with $q=2$. The right plot highlights the additional $\delta=3$ assignment variables that are introduced for the critical objects. With the enlarged search space (150 additional binary variables) all violations can be resolved in the next assignment step. Increasing $q$ uniformly for all objects from $q=2$ to $q=5$ would have let to an increase of 3,000 binary variables.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Lemma 1