Machine Learning in Epidemiology
Marvin N. Wright, Lukas Burk, Pegah Golchian, Jan Kapar, Niklas Koenen, Sophie Hanna Langbein
TL;DR
The chapter addresses the challenge of applying machine learning in epidemiology amid increasingly complex, high-dimensional health data. It presents a principled framework for supervised and unsupervised learning, model evaluation, hyperparameter optimization, and interpretable ML, illustrated with a heart-disease dataset and R/mlr3 workflows. Key contributions include practical guidance on tree-based methods, neural networks, resampling strategies, nested resampling for unbiased tuning, and both model-agnostic and model-specific interpretability techniques, complemented by a discussion of generative modeling and privacy considerations. The work emphasizes robust evaluation, transparent reporting, and responsible use of ML in epidemiology, ensuring predictive performance is balanced with calibration, fairness, and data quality.
Abstract
In the age of digital epidemiology, epidemiologists are faced by an increasing amount of data of growing complexity and dimensionality. Machine learning is a set of powerful tools that can help to analyze such enormous amounts of data. This chapter lays the methodological foundations for successfully applying machine learning in epidemiology. It covers the principles of supervised and unsupervised learning and discusses the most important machine learning methods. Strategies for model evaluation and hyperparameter optimization are developed and interpretable machine learning is introduced. All these theoretical parts are accompanied by code examples in R, where an example dataset on heart disease is used throughout the chapter.
