Table of Contents
Fetching ...

How to avoid machine learning pitfalls: a guide for academic researchers

Michael A. Lones

TL;DR

The paper addresses widespread ML pitfalls in academic research, from data handling to reporting, and offers practical guardrails. It presents a structured, annually updated Dos and Donts guide spanning data use, model building, evaluation, fair comparison, and reporting. Key contributions include concrete practices such as independent test sets, nested cross-validation for hyperparameter tuning, meaningful baselines, multi-metric reporting, and fairness checks, with emphasis on transparency and reproducibility. The guidance aims to improve robustness, trust, and real-world impact of ML research by making methodological rigor feasible for researchers.

Abstract

Mistakes in machine learning practice are commonplace, and can result in a loss of confidence in the findings and products of machine learning. This guide outlines common mistakes that occur when using machine learning, and what can be done to avoid them. Whilst it should be accessible to anyone with a basic understanding of machine learning techniques, it focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to robustly evaluate models, how to compare models fairly, and how to report results.

How to avoid machine learning pitfalls: a guide for academic researchers

TL;DR

The paper addresses widespread ML pitfalls in academic research, from data handling to reporting, and offers practical guardrails. It presents a structured, annually updated Dos and Donts guide spanning data use, model building, evaluation, fair comparison, and reporting. Key contributions include concrete practices such as independent test sets, nested cross-validation for hyperparameter tuning, meaningful baselines, multi-metric reporting, and fairness checks, with emphasis on transparency and reproducibility. The guidance aims to improve robustness, trust, and real-world impact of ML research by making methodological rigor feasible for researchers.

Abstract

Mistakes in machine learning practice are commonplace, and can result in a loss of confidence in the findings and products of machine learning. This guide outlines common mistakes that occur when using machine learning, and what can be done to avoid them. Whilst it should be accessible to anyone with a basic understanding of machine learning techniques, it focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to robustly evaluate models, how to compare models fairly, and how to report results.

Paper Structure

This paper contains 43 sections, 8 figures.

Figures (8)

  • Figure 1: See \ref{['leakage']}. [left] How things should be, with the training set used to train the model, and the test set used to measure its generality. [right] When there's a data leak, the test set can implicitly become part of the training process, meaning that it no longer provides a reliable measure of generality.
  • Figure 2: See \ref{['trends']}. A rough history of neural networks and deep learning, showing what I consider to be the milestones in their development. For a far more thorough account of the field's historical development, take a look at schmidhuber2015deep.
  • Figure 3: See \ref{['feature']}. [top] Data leakage due to carrying out feature selection before splitting off the test data (outlined in red), causing the test set to become an implicit part of model training. [centre] How it should be done. [bottom] When using cross-validation, it's important to carry out feature selection independently for each iteration, based only on the subset of data (shown in blue) used for training during that iteration.
  • Figure 4: See \ref{['spurious']}. The problem of spurious correlations in images, as illustrated by the tank problem. The images on the left are tanks, and those on the right are not tanks. However, the consistent background (blue for tanks, grey for others) means that these images can be classified by merely looking at the colours of pixels towards the top of the images, rather than having to recognise the objects in the images, resulting in a poor model.
  • Figure 5: See \ref{['validation']}. [top] Using the test set repeatedly during model selection results in the test set becoming an implicit part of the training process. [bottom] A validation set should be used instead during model selection, and the test set should only be used once to measure the generality of the final model.
  • ...and 3 more figures