Machine Learning in Epidemiology

Marvin N. Wright; Lukas Burk; Pegah Golchian; Jan Kapar; Niklas Koenen; Sophie Hanna Langbein

Machine Learning in Epidemiology

Marvin N. Wright, Lukas Burk, Pegah Golchian, Jan Kapar, Niklas Koenen, Sophie Hanna Langbein

TL;DR

The chapter addresses the challenge of applying machine learning in epidemiology amid increasingly complex, high-dimensional health data. It presents a principled framework for supervised and unsupervised learning, model evaluation, hyperparameter optimization, and interpretable ML, illustrated with a heart-disease dataset and R/mlr3 workflows. Key contributions include practical guidance on tree-based methods, neural networks, resampling strategies, nested resampling for unbiased tuning, and both model-agnostic and model-specific interpretability techniques, complemented by a discussion of generative modeling and privacy considerations. The work emphasizes robust evaluation, transparent reporting, and responsible use of ML in epidemiology, ensuring predictive performance is balanced with calibration, fairness, and data quality.

Abstract

In the age of digital epidemiology, epidemiologists are faced by an increasing amount of data of growing complexity and dimensionality. Machine learning is a set of powerful tools that can help to analyze such enormous amounts of data. This chapter lays the methodological foundations for successfully applying machine learning in epidemiology. It covers the principles of supervised and unsupervised learning and discusses the most important machine learning methods. Strategies for model evaluation and hyperparameter optimization are developed and interpretable machine learning is introduced. All these theoretical parts are accompanied by code examples in R, where an example dataset on heart disease is used throughout the chapter.

Machine Learning in Epidemiology

TL;DR

Abstract

Paper Structure (41 sections, 27 equations, 16 figures, 1 table)

This paper contains 41 sections, 27 equations, 16 figures, 1 table.

Introduction
Supervised Learning
Tree-based Machine Learning Methods
Classification and Regression Trees
Bagging and Boosting
Data Example
Artificial Neural Networks
Architecture
Training of a Neural Network
Data Example
Model Evaluation and Resampling
Evaluation Metrics
Binary Classification Measures
Resampling and Generalization Performance
Cross-Validation
...and 26 more sections

Figures (16)

Figure 1: (a) Decision tree of the heart disease dataset using only the features ST_depression and serum_cholesterol. (b) The corresponding partition plot. The points are the instances/patients, and the shaded areas denote the model predictions.
Figure 2: A decision tree fitted on the heart disease dataset, visualized with the rpart.plot package. It predicts whether heart disease is present or absent in a patient, based on the features in the dataset, e.g., the results of a thallium stress test (thal). Each node contains the following information: 1) absent/present prediction, 2) proportion of patients with present heart disease, 3) size of the node as percentage of total sample size. The splitting criteria are denoted below the nodes. If the criteria apply, we follow the left path, otherwise the right path.
Figure 3: A visual representation of the cross-validation results of the cost-complexity pruning. The $x$-axis shows the complexity parameter cp. At the top, the number of leaf nodes corresponding to each complexity parameter is denoted. The $y$-axis represents a prediction error measure, based on cross-validation. The lowest error is reached at cp of 0.031. Figure created with the rpart package and slightly modified.
Figure 4: Model architecture of a sequential neural network with seven dense layers, generating predictions $\hat{y}$ from the input $\bm{x}$. On the right, the model's fourth layer is magnified, mapping the previous layer's output $\bm{x}^{(4)}$ to the next layer's input $\bm{x}^{(5)}$ through an affine transformation followed by a non-linear function, denoted as $\sigma_4$.
Figure 5: (a) The graphs of typical activation functions: hyperbolic tangent (green), rectified linear unit (ReLU) (orange), and logistic function (blue) used for probability outcomes. (b) Illustration of the gradient descent technique for learning the optimum $\bm{\hat{\theta}}$ by iteratively updating the current parameter $\bm{\theta}_t$ for $-\eta \nabla J_\mathcal{D}(\bm{\theta}_t)$ units based on the negated tangent's slope at $\bm{\theta}_t$ on the loss function and the learning rate $\eta$.
...and 11 more figures

Machine Learning in Epidemiology

TL;DR

Abstract

Machine Learning in Epidemiology

Authors

TL;DR

Abstract

Table of Contents

Figures (16)