Table of Contents
Fetching ...

Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions

Eve Fleisig, Matthias Orlikowski, Philipp Cimiano, Dan Klein

TL;DR

The paper addresses how to preserve population-relevant variation in subjective labeling while filtering out spam on crowdsourcing platforms. It empirically evaluates three common annotator-filtering methods (MACE, CrowdTruth, Cohen's Kappa) on two real datasets with known spammers, examining effects on label diversity and accuracy. The key finding is that removing annotators to reduce spam often harms the representation of variation, with optimal gains when only a small fraction (<5%) is removed; moreover, many spammers resemble non-spammers, and fixed-spammer behavior is particularly challenging to detect. The work highlights the need for spam-removal approaches that explicitly account for label diversity, suggesting richer signals beyond labeling behavior and urging shared benchmarks and datasets to improve future methods.

Abstract

For machine learning datasets to accurately represent diverse opinions in a population, they must preserve variation in data labels while filtering out spam or low-quality responses. How can we balance annotator reliability and representation? We empirically evaluate how a range of heuristics for annotator filtering affect the preservation of variation on subjective tasks. We find that these methods, designed for contexts in which variation from a single ground-truth label is considered noise, often remove annotators who disagree instead of spam annotators, introducing suboptimal tradeoffs between accuracy and label diversity. We find that conservative settings for annotator removal (<5%) are best, after which all tested methods increase the mean absolute error from the true average label. We analyze performance on synthetic spam to observe that these methods often assume spam annotators are more random than real spammers tend to be: most spammers are distributionally indistinguishable from real annotators, and the minority that are distinguishable tend to give relatively fixed answers, not random ones. Thus, tasks requiring the preservation of variation reverse the intuition of existing spam filtering methods: spammers tend to be less random than non-spammers, so metrics that assume variation is spam fare worse. These results highlight the need for spam removal methods that account for label diversity.

Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions

TL;DR

The paper addresses how to preserve population-relevant variation in subjective labeling while filtering out spam on crowdsourcing platforms. It empirically evaluates three common annotator-filtering methods (MACE, CrowdTruth, Cohen's Kappa) on two real datasets with known spammers, examining effects on label diversity and accuracy. The key finding is that removing annotators to reduce spam often harms the representation of variation, with optimal gains when only a small fraction (<5%) is removed; moreover, many spammers resemble non-spammers, and fixed-spammer behavior is particularly challenging to detect. The work highlights the need for spam-removal approaches that explicitly account for label diversity, suggesting richer signals beyond labeling behavior and urging shared benchmarks and datasets to improve future methods.

Abstract

For machine learning datasets to accurately represent diverse opinions in a population, they must preserve variation in data labels while filtering out spam or low-quality responses. How can we balance annotator reliability and representation? We empirically evaluate how a range of heuristics for annotator filtering affect the preservation of variation on subjective tasks. We find that these methods, designed for contexts in which variation from a single ground-truth label is considered noise, often remove annotators who disagree instead of spam annotators, introducing suboptimal tradeoffs between accuracy and label diversity. We find that conservative settings for annotator removal (<5%) are best, after which all tested methods increase the mean absolute error from the true average label. We analyze performance on synthetic spam to observe that these methods often assume spam annotators are more random than real spammers tend to be: most spammers are distributionally indistinguishable from real annotators, and the minority that are distinguishable tend to give relatively fixed answers, not random ones. Thus, tasks requiring the preservation of variation reverse the intuition of existing spam filtering methods: spammers tend to be less random than non-spammers, so metrics that assume variation is spam fare worse. These results highlight the need for spam removal methods that account for label diversity.

Paper Structure

This paper contains 28 sections, 2 equations, 12 figures.

Figures (12)

  • Figure 1: Across methods, increasing the number of removed annotators gradually decreases the accuracy of spam classification when over 2-4% of annotators are removed. Cohen's kappa and MACE increase the spam classification accuracy up to 4% of annotators removed on DICES; otherwise, the spam classification accuracy rarely rises above the baseline of not removing any annotators. The blue line indicates the true number of spammers in the data; the gray line indicates the baseline classification accuracy before removing any spammers.
  • Figure 2: Entropy of each instance's label distribution, averaged over all instances. Most methods decrease the entropy of the dataset as more raters are removed. CrowdTruth especially decreases the entropy.
  • Figure 3: On the MTurk dataset, all methods except random removal decrease the standard deviation of the dataset. Among the tested methods, CrowdTruth decreases the standard deviation most.
  • Figure 4: Entropy of each annotator's labeling distribution over all instances vs. score under filtering metrics (CrowdTruth and MACE). While many spam annotators are indistinguishable from non-spam ones under these metrics, those that are often have very low entropy: they are less random than non-spam annotators, not more.
  • Figure 5: Mean absolute error of filtered ratings. Difference between average label on an example of non-spam annotators and filtered annotators, then averaged across examples.
  • ...and 7 more figures