Table of Contents
Fetching ...

A Guide to Failure in Machine Learning: Reliability and Robustness from Foundations to Practice

Eric Heim, Oren Wright, David Shriver

TL;DR

This paper presents a pragmatic guide to understanding and mitigating ML failure by separating failure into reliability (in-distribution performance) and robustness (performance under distribution shift). It links formal generalization and risk concepts to engineering practice, detailing data-collection strategies, test-case-based evaluation, and self-assessment/monitoring approaches, including calibration and probabilistic uncertainty. The work surveys techniques for handling reliability (data strategies, test design, monitoring) and robustness (uncertainty types, perturbations, and domain shifts), and discusses practical implications such as MLOps workflows and system-level considerations. The key contribution is a structured framework that helps practitioners reason about where failures arise and how to design data, tests, and monitoring to improve dependable deployment in real-world, changing environments.

Abstract

One of the main barriers to adoption of Machine Learning (ML) is that ML models can fail unexpectedly. In this work, we aim to provide practitioners a guide to better understand why ML models fail and equip them with techniques they can use to reason about failure. Specifically, we discuss failure as either being caused by lack of reliability or lack of robustness. Differentiating the causes of failure in this way allows us to formally define why models fail from first principles and tie these definitions to engineering concepts and real-world deployment settings. Throughout the document we provide 1) a summary of important theoretic concepts in reliability and robustness, 2) a sampling current techniques that practitioners can utilize to reason about ML model reliability and robustness, and 3) examples that show how these concepts and techniques can apply to real-world settings.

A Guide to Failure in Machine Learning: Reliability and Robustness from Foundations to Practice

TL;DR

This paper presents a pragmatic guide to understanding and mitigating ML failure by separating failure into reliability (in-distribution performance) and robustness (performance under distribution shift). It links formal generalization and risk concepts to engineering practice, detailing data-collection strategies, test-case-based evaluation, and self-assessment/monitoring approaches, including calibration and probabilistic uncertainty. The work surveys techniques for handling reliability (data strategies, test design, monitoring) and robustness (uncertainty types, perturbations, and domain shifts), and discusses practical implications such as MLOps workflows and system-level considerations. The key contribution is a structured framework that helps practitioners reason about where failures arise and how to design data, tests, and monitoring to improve dependable deployment in real-world, changing environments.

Abstract

One of the main barriers to adoption of Machine Learning (ML) is that ML models can fail unexpectedly. In this work, we aim to provide practitioners a guide to better understand why ML models fail and equip them with techniques they can use to reason about failure. Specifically, we discuss failure as either being caused by lack of reliability or lack of robustness. Differentiating the causes of failure in this way allows us to formally define why models fail from first principles and tie these definitions to engineering concepts and real-world deployment settings. Throughout the document we provide 1) a summary of important theoretic concepts in reliability and robustness, 2) a sampling current techniques that practitioners can utilize to reason about ML model reliability and robustness, and 3) examples that show how these concepts and techniques can apply to real-world settings.

Paper Structure

This paper contains 33 sections, 21 equations.