Effects of Training Data Quality on Classifier Performance

Alan F. Karr; Regina Ruane

Effects of Training Data Quality on Classifier Performance

Alan F. Karr, Regina Ruane

TL;DR

A picture of spatial heterogeneity emerges: as the training data move farther from analysis data, classifier decisions degenerate, the boundary becomes less dense, and congruence increases.

Abstract

We describe extensive numerical experiments assessing and quantifying how classifier performance depends on the quality of the training data, a frequently neglected component of the analysis of classifiers. More specifically, in the scientific context of metagenomic assembly of short DNA reads into "contigs," we examine the effects of degrading the quality of the training data by multiple mechanisms, and for four classifiers -- Bayes classifiers, neural nets, partition models and random forests. We investigate both individual behavior and congruence among the classifiers. We find breakdown-like behavior that holds for all four classifiers, as degradation increases and they move from being mostly correct to only coincidentally correct, because they are wrong in the same way. In the process, a picture of spatial heterogeneity emerges: as the training data move farther from analysis data, classifier decisions degenerate, the boundary becomes less dense, and congruence increases.

Effects of Training Data Quality on Classifier Performance

TL;DR

A picture of spatial heterogeneity emerges: as the training data move farther from analysis data, classifier decisions degenerate, the boundary becomes less dense, and congruence increases.

Abstract

Paper Structure (34 sections, 12 equations, 35 figures, 6 tables)

This paper contains 34 sections, 12 equations, 35 figures, 6 tables.

Introduction
Background and Problem Formulation
Data Quality
Scientific Context
Classifiers
Classifier Boundaries
Surrogate Uncertainty Measures
Experimental Protocol
Mathematical Preliminaries
Training and Validation Datasets
The Four Classifiers
Bayes Classifier
Neural Net Classifier
Partition Model Classifier
Random Forest Classifier
...and 19 more sections

Figures (35)

Figure 1: Effect of SNP degradation dqdegradation-2021 on the entropy of 26964 virus genomes.
Figure 2: Confusion matrices for the validation dataset $\mathcal{V}$ for the four classifiers trained on the undegraded training data $\mathcal{T}$. Top left: Bayes classifier. Top right: neural net. Bottom left: partition model. Bottom right: random forest
Figure 3: SNP degradation: number of correctly classified elements of $\mathcal{V}$ as a function of SNP_Probability.
Figure 4: SNP degradation: classifier predictions as a function of SNP_Probability. Upper left: Bayes classifier. Upper right: neural net. Lower left: partition model. Lower right: random forest.
Figure 5: SNP degradation: Boundary Status distribution as function of SNP_Probability. Upper left: Bayes classifier. Upper right: neural net. Lower left: Partition model. Lower right: random forest. In each of these, $\hbox{BS} = 0$ is the green line, $\hbox{BS} = 1$ is the yellow line, and $\hbox{BS} = 2$ is the red line.
...and 30 more figures

Effects of Training Data Quality on Classifier Performance

TL;DR

Abstract

Effects of Training Data Quality on Classifier Performance

Authors

TL;DR

Abstract

Table of Contents

Figures (35)