Crowdsourcing Without People: Modelling Clustering Algorithms as Experts
Jordyn E. A. Lorentz, Katharine M. Clark
TL;DR
The paper addresses the challenge of selecting a single clustering algorithm when the true data structure is unknown by proposing mixsemble, an EM-based ensemble that treats results from multiple clustering methods as noisy annotations under a Dawid-Skene framework. The method operates in two stages: (1) run diverse clustering algorithms under different distributions, and (2) aggregate their outputs using latent true labels and per-algorithm error rates, estimated via EM with a finite mixture prior $f(\\mathbf{x}\\mid \\boldsymbol{\\vartheta})=\\sum_{g=1}^G \\pi_g f_g(\\mathbf{x}\\mid \\boldsymbol{\\theta}_g)$. Core equations include the likelihood $\\prod_{i=1}^N(\\sum_{g=1}^G \\pi_g \\prod_{k=1}^K \\prod_{h=1}^G (\\varepsilon_{gh}^{(k)})^{x_{ih}^{(k)}})$ and EM updates for $\\hat{z}_{ig}$, $\\hat{\\varepsilon}_{gh}^{(k)}$, and $\\hat{\\pi}_g$. Empirically, mixsemble matches or closely trails the best-performing algorithm while offering robustness against poor individual results, across simulated and real datasets. This makes it a practical, non-expert-friendly tool for obtaining reliable clustering when the underlying structure is unknown.
Abstract
This paper introduces mixsemble, an ensemble method that adapts the Dawid-Skene model to aggregate predictions from multiple model-based clustering algorithms. Unlike traditional crowdsourcing, which relies on human labels, the framework models the outputs of clustering algorithms as noisy annotations. Experiments on both simulated and real-world datasets show that, although the mixsemble is not always the single top performer, it consistently approaches the best result and avoids poor outcomes. This robustness makes it a practical alternative when the true data structure is unknown, especially for non-expert users.
