Table of Contents
Fetching ...

Continuous Sweep for Binary Quantification Learning

Kevin Kloos, Julian D. Karch, Quinten A. Meertens, Mark de Rooij

TL;DR

It is shown in three simulation studies that Continuous Sweep outperforms the quantifiers in the group Classify, Count, and Correct, and is competitive with the two best quantifiers from the group Distribution Matchers.

Abstract

A quantifier is a supervised machine learning algorithm, focused on estimating the class prevalence in a dataset rather than labeling its individual observations. We introduce Continuous Sweep, a new parametric binary quantifier inspired by the well-performing Median Sweep, which is an ensemble method based on Adjusted Count estimators. We modified two aspects of Median Sweep: 1) using parametric class distributions instead of empirical distributions for the true and false positive rate; 2) using the mean instead of the median of a set of Adjusted Count estimates. These two modifications allow for a theoretical analysis of the bias and variance of Continuous Sweep. Furthermore, the expressions of bias and variance can be used to define optimal decision boundaries of the set of Adjusted count estimates to be used in the ensemble. We show in three simulation studies that Continuous Sweep outperforms the quantifiers in the group Classify, Count, and Correct, including Median Sweep, and is competitive with the two best quantifiers from the group Distribution Matchers. Also an empirical data set is analysed with these quantifiers showing similar performances.

Continuous Sweep for Binary Quantification Learning

TL;DR

It is shown in three simulation studies that Continuous Sweep outperforms the quantifiers in the group Classify, Count, and Correct, and is competitive with the two best quantifiers from the group Distribution Matchers.

Abstract

A quantifier is a supervised machine learning algorithm, focused on estimating the class prevalence in a dataset rather than labeling its individual observations. We introduce Continuous Sweep, a new parametric binary quantifier inspired by the well-performing Median Sweep, which is an ensemble method based on Adjusted Count estimators. We modified two aspects of Median Sweep: 1) using parametric class distributions instead of empirical distributions for the true and false positive rate; 2) using the mean instead of the median of a set of Adjusted Count estimates. These two modifications allow for a theoretical analysis of the bias and variance of Continuous Sweep. Furthermore, the expressions of bias and variance can be used to define optimal decision boundaries of the set of Adjusted count estimates to be used in the ensemble. We show in three simulation studies that Continuous Sweep outperforms the quantifiers in the group Classify, Count, and Correct, including Median Sweep, and is competitive with the two best quantifiers from the group Distribution Matchers. Also an empirical data set is analysed with these quantifiers showing similar performances.
Paper Structure (17 sections, 2 theorems, 33 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 2 theorems, 33 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Assuming that $F^+(\theta)$ and $F^-(\theta)$ are known, the expected value of the Continuous Sweep quantifier $\hat{\alpha}_\text{CS}$ is equal to the true prevalence $\alpha$ and is therefore unbiased.

Figures (9)

  • Figure 1: The difference between Median Sweep and Continuous Sweep in estimating the true and false positive rate. In Figure (a), the histogram shows the distributions of the discriminant scores in the negative (purple) and the positive (yellow) class, where the corresponding coloured lines show the fitted Normal distributions. In Figure (b), Median Sweep uses empirical cumulative distribution functions (i.e., step functions) while Continuous Sweep uses the Gaussian cumulative distribution (i.e., smooth curves) to estimate the true/false positive rate.
  • Figure 2: Using a Classify and Count function as a step function (Panel (a)) and the true and false positive rate as continuous functions (Panel (b)), the Adjusted Count quantifier can estimate prevalence $\alpha$ at each $\theta$ (Panel (c)).
  • Figure 3: Example of discarding individual data points (red/green) against decision boundaries (grey, vertical). For Median Sweep, the green points are included in the prevalence calculation, while the red points are excluded. Continuous Sweep includes every $\theta$ between $\theta_\text{l}$ and $\theta_\text{r}$ in the calculations.
  • Figure 4: Example of using integrals to compute the prevalence using Continuous Sweep. The estimated prevalence is computed by summing the coloured areas divided by the difference between $\theta_\text{l}$ and $\theta_\text{r}$.
  • Figure 5: Panel (a) illustrates the difference between $F^+(\theta)$ and $F^-(\theta)$ for each $\theta$. Panel (b) illustrates the theoretical variance at different $p^\Delta$ of the Continuous Sweep with a $D_\text{test}$ of size $1000$ and prevalence $\alpha = 0.5$, where $F^+ \sim N(1, 1)$ and $F^- \sim N(0, 1)$.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof