Table of Contents
Fetching ...

Memetic Differential Evolution Methods for Semi-Supervised Clustering

Pierluigi Mansueto, Fabio Schoen

TL;DR

The paper tackles semi-supervised Minimum Sum-of-Squares Clustering with must-link and cannot-link constraints, an NP-hard problem. It introduces S-MDEClust, a memetic differential evolution framework that guarantees feasibility through exact and greedy assignment, a constraint-aware mutation, and a semi-supervised local-search step. Extensive experiments compare S-MDEClust variants against COP-K-MEAN, Baumann's BLP-KM, and global state-of-the-art methods PC-SOS-SDP and S-HG-MEANS, showing favorable performance in both feasibility and efficiency. The work establishes the first feasible memetic approach for semi-supervised MSSC and provides a foundation for future constraint-handling enhancements in clustering.

Abstract

In this paper, we propose an extension for semi-supervised Minimum Sum-of-Squares Clustering (MSSC) problems of MDEClust, a memetic framework based on the Differential Evolution paradigm for unsupervised clustering. In semi-supervised MSSC, background knowledge is available in the form of (instance-level) "must-link" and "cannot-link" constraints, each of which indicating if two dataset points should be associated to the same or to a different cluster, respectively. The presence of such constraints makes the problem at least as hard as its unsupervised version and, as a consequence, some framework operations need to be carefully designed to handle this additional complexity: for instance, it is no more true that each point is associated to its nearest cluster center. As far as we know, our new framework, called S-MDEClust, represents the first memetic methodology designed to generate a (hopefully) optimal feasible solution for semi-supervised MSSC problems. Results of thorough computational experiments on a set of well-known as well as synthetic datasets show the effectiveness and efficiency of our proposal.

Memetic Differential Evolution Methods for Semi-Supervised Clustering

TL;DR

The paper tackles semi-supervised Minimum Sum-of-Squares Clustering with must-link and cannot-link constraints, an NP-hard problem. It introduces S-MDEClust, a memetic differential evolution framework that guarantees feasibility through exact and greedy assignment, a constraint-aware mutation, and a semi-supervised local-search step. Extensive experiments compare S-MDEClust variants against COP-K-MEAN, Baumann's BLP-KM, and global state-of-the-art methods PC-SOS-SDP and S-HG-MEANS, showing favorable performance in both feasibility and efficiency. The work establishes the first feasible memetic approach for semi-supervised MSSC and provides a foundation for future constraint-handling enhancements in clustering.

Abstract

In this paper, we propose an extension for semi-supervised Minimum Sum-of-Squares Clustering (MSSC) problems of MDEClust, a memetic framework based on the Differential Evolution paradigm for unsupervised clustering. In semi-supervised MSSC, background knowledge is available in the form of (instance-level) "must-link" and "cannot-link" constraints, each of which indicating if two dataset points should be associated to the same or to a different cluster, respectively. The presence of such constraints makes the problem at least as hard as its unsupervised version and, as a consequence, some framework operations need to be carefully designed to handle this additional complexity: for instance, it is no more true that each point is associated to its nearest cluster center. As far as we know, our new framework, called S-MDEClust, represents the first memetic methodology designed to generate a (hopefully) optimal feasible solution for semi-supervised MSSC problems. Results of thorough computational experiments on a set of well-known as well as synthetic datasets show the effectiveness and efficiency of our proposal.
Paper Structure (15 sections, 4 equations, 3 figures, 4 tables, 3 algorithms)

This paper contains 15 sections, 4 equations, 3 figures, 4 tables, 3 algorithms.

Figures (3)

  • Figure 1: Two-dimensional examples of assignment step outcome with $N = 20$ points and $K = 3$ cluster centers. The dotted black lines indicate the separations between clusters, while shaded areas just emphasize cluster centers.
  • Figure 2: Performance profiles for SG-MDE, S-MDE, SMG-MDE and SM-MDE on the datasets Iris, Accent and ECG5000 (see Table \ref{['tab::datasets']}). Note that the intervals of the axes were set for a better visualization of the numerical results.
  • Figure 3: Performance profiles for SG-MDE, BLP-KM and COP-KM on the large-size datasets ($2^{\text{nd}}$ set of Table \ref{['tab::datasets']}). Note that the intervals of the x-axes were set for a better visualization of the numerical results.