Standardness Clouds Meaning: A Position Regarding the Informed Usage of Standard Datasets

Tim Cech; Ole Wegen; Daniel Atzberger; Rico Richter; Willy Scheibel; Jürgen Döllner

Standardness Clouds Meaning: A Position Regarding the Informed Usage of Standard Datasets

Tim Cech, Ole Wegen, Daniel Atzberger, Rico Richter, Willy Scheibel, Jürgen Döllner

TL;DR

The paper argues that the assumed standardness of widely used datasets can mask misalignment between use-case labels and underlying concepts, eroding trust in ML models. It proposes a quali-quantitative methodology that combines Grounded Theory with Hypothesis Testing through Visualization (VIS4GT) to interrogate dataset-label-use-case fit, illustrated on the 20 Newsgroups and MNIST datasets. The 20 Newsgroups case reveals imprecise labels and poor suitability discussion, while MNIST shows relatively coherent labeling, validating the method's ability to discriminate between problematic and solid standard datasets. The work highlights the need to assess dataset quality and suitability beyond conventional standardness and advocates iterative, human-in-the-loop dataset refinement to enhance explainability and trust in ML systems.

Abstract

Standard datasets are frequently used to train and evaluate Machine Learning models. However, the assumed standardness of these datasets leads to a lack of in-depth discussion on how their labels match the derived categories for the respective use case, which we demonstrate by reviewing recent literature that employs standard datasets. We find that the standardness of the datasets seems to cloud their actual coherency and applicability, thus impeding the trust in Machine Learning models trained on these datasets. Therefore, we argue against the uncritical use of standard datasets and advocate for their critical examination instead. For this, we suggest to use Grounded Theory in combination with Hypotheses Testing through Visualization as methods to evaluate the match between use case, derived categories, and labels. We exemplify this approach by applying it to the 20 Newsgroups dataset and the MNIST dataset, both considered standard datasets in their respective domain. The results show that the labels of the 20 Newsgroups dataset are imprecise, which implies that neither a Machine Learning model can learn a meaningful abstraction of derived categories nor one can draw conclusions from achieving high accuracy on this dataset. For the MNIST dataset, we demonstrate that the labels can be confirmed to be defined well. We conclude that also for datasets that are considered to be standard, quality and suitability have to be assessed in order to learn meaningful abstractions and, thus, improve trust in Machine Learning models.

Standardness Clouds Meaning: A Position Regarding the Informed Usage of Standard Datasets

TL;DR

Abstract

Paper Structure (22 sections, 9 figures, 2 tables)

This paper contains 22 sections, 9 figures, 2 tables.

Introduction
The Importance of Dataset Assessment
Adressing Dataset Quality
Addressing Dataset Suitability
Standardness Does not Equal Quality
Researchers' Questionable Reliance on Standardness
20 Newsgroups Dataset
MNIST Dataset
Summary
A Quali-Quantitative Method for Dataset Interrogation
Grounded Theory
Hypothesis Testing through Visualization
Dataset Interrogation Exemplified
The 20 Newsgroups dataset
Grounded Theory.
...and 7 more sections

Figures (9)

Figure 1: We argue that a data scientist must actively ensure the match between the use case, labels, and categories (dashed lines). The use case provides the context for deriving categories which should match the labels (solid lines).
Figure 2: Document 51060 of the category alt.atheism. The text contains an article about atheism and therefore has a direct link to the category label.
Figure 3: Document 51194 of the class label alt.atheism. The text is part of a discussion about whether the growing number of people identifying as atheists is correlated to the growing number of depression cases.
Figure 4: The 20 Newsgroups dataset was reduced with t-SNE and LSI as proposed by Atzberger and Cech et al. benchmarking. The documents 51060, 51194, 52910, and 53449 are highlighted. They are wide-spread in the visualization and are also semantically dissimilar.
Figure 5: The 20 Newsgroups dataset was reduced with t-SNE and LDA as proposed by Atzberger and Cech et al. benchmarking. The documents 51060, 51194, and 52910, and 53449 are highlighted. They are wide-spread in the visualization and are also semantically dissimilar.
...and 4 more figures

Standardness Clouds Meaning: A Position Regarding the Informed Usage of Standard Datasets

TL;DR

Abstract

Standardness Clouds Meaning: A Position Regarding the Informed Usage of Standard Datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (9)