Position: Why We Must Rethink Empirical Research in Machine Learning

Moritz Herrmann; F. Julian D. Lange; Katharina Eggensperger; Giuseppe Casalicchio; Marcel Wever; Matthias Feurer; David Rügamer; Eyke Hüllermeier; Anne-Laure Boulesteix; Bernd Bischl

Position: Why We Must Rethink Empirical Research in Machine Learning

Moritz Herrmann, F. Julian D. Lange, Katharina Eggensperger, Giuseppe Casalicchio, Marcel Wever, Matthias Feurer, David Rügamer, Eyke Hüllermeier, Anne-Laure Boulesteix, Bernd Bischl

TL;DR

This paper argues that empirical ML is prone to non-replicable and overly optimistic conclusions due to biased study designs and an overemphasis on confirmatory framing. It advocates a plural, continuum-based approach that blends exploratory and confirmatory research, distinguishing insight-oriented studies from method-developing work, and calling for neutral comparisons and replication efforts. Practical recommendations include broader infrastructure for benchmarking, open datasets, preregistration, replication/meta-studies, and education across disciplines to improve empirical rigor. By treating ML as a maturing empirical science, the work aims to enhance reliability, mitigate questionable practices, and curb misinterpretation of statistical results, thereby supporting more robust and trustworthy progress in the field.

Abstract

We warn against a common but incomplete understanding of empirical research in machine learning that leads to non-replicable results, makes findings unreliable, and threatens to undermine progress in the field. To overcome this alarming situation, we call for more awareness of the plurality of ways of gaining knowledge experimentally but also of some epistemic limitations. In particular, we argue most current empirical machine learning research is fashioned as confirmatory research while it should rather be considered exploratory.

Position: Why We Must Rethink Empirical Research in Machine Learning

TL;DR

Abstract

Position: Why We Must Rethink Empirical Research in Machine Learning

Authors

TL;DR

Abstract

Table of Contents