Table of Contents
Fetching ...

Robust Mixture Learning when Outliers Overwhelm Small Groups

Daniil Dmitriev, Rares-Darius Buhai, Stefan Tiegel, Alexander Wolters, Gleb Novikov, Amartya Sanyal, David Steurer, Fanny Yang

TL;DR

This work addresses the problem of learning the means of a $k$-component well-separated mixture under adversarial contamination that can overwhelm small inlier groups (LD-ML). It introduces a two-stage meta-algorithm that composes LD-ME (corresponding to unknown inlier fractions) and robust mean estimation base-learners to produce a short list of candidate means with per-component accuracy close to that of an oracle. The authors prove order-optimal error guarantees with a list-size overhead of $O\left(\varepsilon / w_{\mathrm{low}}\right)$, with particularly strong results for Gaussian mixtures where separation is leveraged to achieve tight bounds; they also present information-theoretic lower bounds showing near-optimality. The results yield significant improvements over prior LD-ME approaches in both error and list size, especially in the presence of large adversarial contamination, and hold promise for practical applications in clustering and mixture learning under severe noise.

Abstract

We study the problem of estimating the means of well-separated mixtures when an adversary may add arbitrary outliers. While strong guarantees are available when the outlier fraction is significantly smaller than the minimum mixing weight, much less is known when outliers may crowd out low-weight clusters - a setting we refer to as list-decodable mixture learning (LD-ML). In this case, adversarial outliers can simulate additional spurious mixture components. Hence, if all means of the mixture must be recovered up to a small error in the output list, the list size needs to be larger than the number of (true) components. We propose an algorithm that obtains order-optimal error guarantees for each mixture mean with a minimal list-size overhead, significantly improving upon list-decodable mean estimation, the only existing method that is applicable for LD-ML. Although improvements are observed even when the mixture is non-separated, our algorithm achieves particularly strong guarantees when the mixture is separated: it can leverage the mixture structure to partially cluster the samples before carefully iterating a base learner for list-decodable mean estimation at different scales.

Robust Mixture Learning when Outliers Overwhelm Small Groups

TL;DR

This work addresses the problem of learning the means of a -component well-separated mixture under adversarial contamination that can overwhelm small inlier groups (LD-ML). It introduces a two-stage meta-algorithm that composes LD-ME (corresponding to unknown inlier fractions) and robust mean estimation base-learners to produce a short list of candidate means with per-component accuracy close to that of an oracle. The authors prove order-optimal error guarantees with a list-size overhead of , with particularly strong results for Gaussian mixtures where separation is leveraged to achieve tight bounds; they also present information-theoretic lower bounds showing near-optimality. The results yield significant improvements over prior LD-ME approaches in both error and list size, especially in the presence of large adversarial contamination, and hold promise for practical applications in clustering and mixture learning under severe noise.

Abstract

We study the problem of estimating the means of well-separated mixtures when an adversary may add arbitrary outliers. While strong guarantees are available when the outlier fraction is significantly smaller than the minimum mixing weight, much less is known when outliers may crowd out low-weight clusters - a setting we refer to as list-decodable mixture learning (LD-ML). In this case, adversarial outliers can simulate additional spurious mixture components. Hence, if all means of the mixture must be recovered up to a small error in the output list, the list size needs to be larger than the number of (true) components. We propose an algorithm that obtains order-optimal error guarantees for each mixture mean with a minimal list-size overhead, significantly improving upon list-decodable mean estimation, the only existing method that is applicable for LD-ML. Although improvements are observed even when the mixture is non-separated, our algorithm achieves particularly strong guarantees when the mixture is separated: it can leverage the mixture structure to partially cluster the samples before carefully iterating a base learner for list-decodable mean estimation at different scales.
Paper Structure (64 sections, 15 theorems, 47 equations, 10 figures, 1 table, 7 algorithms)

This paper contains 64 sections, 15 theorems, 47 equations, 10 figures, 1 table, 7 algorithms.

Key Result

Theorem 3.3

Let $d, k \in \mathbb N_+$, $w_{\mathrm{low}} \in (0, 1/2]$, and $t$ be an even integer. Let $\mathcal{X}$ be a $d$-dimensional mixture distribution following eq:gen_model. Let $\mathcal{A}_{\mathrm{kLD}}$ and $\mathcal{A}_R$ satisfy asm:algsasm:well-behaved for some even $t$. Further, suppose that If the relative weight of the $i$-th cluster is large, i.e., $\tilde{\varepsilon}_i \leqslant 0.001

Figures (10)

  • Figure 1: Schematic of the meta-algorithm (\ref{['alg:full_alg']}) underlying \ref{['thm:informal']}
  • Figure 2: Comparison of five algorithms with two adversarial noise models. The attack distributions and further experimental details are given in \ref{['app:exp-details']}. On the left we show worst estimation error for constrained list size and on the right the smallest list size for constrained error guarantee. We plot the median of the metrics with the error bars showing $25$th and $75$th percentile.
  • Figure 3: Comparison of list size and estimation error for large inlier cluster for varying $w_{\mathrm{low}}$ inputs. The experimental setup is illustrated in \ref{['app:exp-details']}. We plot the median values with error bars showing $25$th and $75$th quantiles. As $w_{\mathrm{low}}$ decreases, we observe a roughly constant estimation error for our algorithm while the error for LD-ME increases. Further, the decrease in list size is much more severe for LD-ME than for our algorithm.
  • Figure 4: Two variants of adversarial distribution: adversarial line (left) and adversarial clusters (right).
  • Figure 5: Comparison of five algorithms with three adversarial noise models. On the left we show worst estimation error of algorithms with constrained list size and on the right the smallest list size with constrained error guarantee. We plot the median of the metrics with the error bars showing $25$th and $75$th percentile. We observe that our method consistently outperforms prior works in terms of list size and worst estimation error, with the exception of DBSCAN, which performs at a similiar level.
  • ...and 5 more figures

Theorems & Definitions (27)

  • Definition 2.1
  • Theorem 3.3
  • Corollary 3.4: Gaussian case
  • proof
  • Proposition 3.5: Information-theoretic lower bounds
  • Definition B.1: Corruption model
  • Theorem B.2: Inner stage guarantees
  • Remark B.3
  • Remark B.4
  • Corollary B.5
  • ...and 17 more