Table of Contents
Fetching ...

Evaluating Double Descent in Machine Learning: Insights from Tree-Based Models Applied to a Genomic Prediction Task

Guillermo Comesaña Cimadevila

TL;DR

This paper investigates double descent in classical tree-based models applied to predicting isoniazid resistance in Mycobacterium tuberculosis from whole-genome sequencing data. The authors vary model complexity along two axes—base-learner capacity ($P^{\text{leaf}}$, $P^{\text{boost}}$) and ensemble size ($P^{\text{ens}}$)—and evaluate decision trees, random forests, and gradient boosting on CRyPTIC data and a Friedman #1 synthetic benchmark, using mean squared error as the metric. They find that a double-descent curve emerges only under composite scaling, while axis-specific scaling restores conventional bias–variance dynamics; the results support the unfolding hypothesis that distinct generalisation regimes are projected onto a single axis. The results highlight that ensemble size provides a stabilising regularisation effect and emphasize multidimensional complexity analysis for hyperparameter tuning. The work provides a practical framework and publicly available code to study generalisation dynamics in high-dimensional, real-world genomic data.

Abstract

Classical learning theory describes a well-characterised U-shaped relationship between model complexity and prediction error, reflecting a transition from underfitting in underparameterised regimes to overfitting as complexity grows. Recent work, however, has introduced the notion of a second descent in test error beyond the interpolation threshold-giving rise to the so-called double descent phenomenon. While double descent has been studied extensively in the context of deep learning, it has also been reported in simpler models, including decision trees and gradient boosting. In this work, we revisit these claims through the lens of classical machine learning applied to a biological classification task: predicting isoniazid resistance in Mycobacterium tuberculosis using whole-genome sequencing data. We systematically vary model complexity along two orthogonal axes-learner capacity (e.g., Pleaf, Pboost) and ensemble size (i.e., Pens)-and show that double descent consistently emerges only when complexity is scaled jointly across these axes. When either axis is held fixed, generalisation behaviour reverts to classical U- or L-shaped patterns. These results are replicated on a synthetic benchmark and support the unfolding hypothesis, which attributes double descent to the projection of distinct generalisation regimes onto a single complexity axis. Our findings underscore the importance of treating model complexity as a multidimensional construct when analysing generalisation behaviour. All code and reproducibility materials are available at: https://github.com/guillermocomesanacimadevila/Demystifying-Double-Descent-in-ML.

Evaluating Double Descent in Machine Learning: Insights from Tree-Based Models Applied to a Genomic Prediction Task

TL;DR

This paper investigates double descent in classical tree-based models applied to predicting isoniazid resistance in Mycobacterium tuberculosis from whole-genome sequencing data. The authors vary model complexity along two axes—base-learner capacity (, ) and ensemble size ()—and evaluate decision trees, random forests, and gradient boosting on CRyPTIC data and a Friedman #1 synthetic benchmark, using mean squared error as the metric. They find that a double-descent curve emerges only under composite scaling, while axis-specific scaling restores conventional bias–variance dynamics; the results support the unfolding hypothesis that distinct generalisation regimes are projected onto a single axis. The results highlight that ensemble size provides a stabilising regularisation effect and emphasize multidimensional complexity analysis for hyperparameter tuning. The work provides a practical framework and publicly available code to study generalisation dynamics in high-dimensional, real-world genomic data.

Abstract

Classical learning theory describes a well-characterised U-shaped relationship between model complexity and prediction error, reflecting a transition from underfitting in underparameterised regimes to overfitting as complexity grows. Recent work, however, has introduced the notion of a second descent in test error beyond the interpolation threshold-giving rise to the so-called double descent phenomenon. While double descent has been studied extensively in the context of deep learning, it has also been reported in simpler models, including decision trees and gradient boosting. In this work, we revisit these claims through the lens of classical machine learning applied to a biological classification task: predicting isoniazid resistance in Mycobacterium tuberculosis using whole-genome sequencing data. We systematically vary model complexity along two orthogonal axes-learner capacity (e.g., Pleaf, Pboost) and ensemble size (i.e., Pens)-and show that double descent consistently emerges only when complexity is scaled jointly across these axes. When either axis is held fixed, generalisation behaviour reverts to classical U- or L-shaped patterns. These results are replicated on a synthetic benchmark and support the unfolding hypothesis, which attributes double descent to the projection of distinct generalisation regimes onto a single complexity axis. Our findings underscore the importance of treating model complexity as a multidimensional construct when analysing generalisation behaviour. All code and reproducibility materials are available at: https://github.com/guillermocomesanacimadevila/Demystifying-Double-Descent-in-ML.

Paper Structure

This paper contains 4 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: Double descent illustration emerging from two complexity axes. Left: error varies across two model complexity dimensions, forming a U-curve (blue) along one axis and an L-curve (red) along the other. Right: collapsing these dimensions produces the double descent curve, suggesting it may arise from merging distinct generalisation behaviours. Figure adapted from Curth et al. Curth2023.
  • Figure 2: Methodological pipeline. The blue dashed line indicates the branch where synthetic data experiments were conducted, following the same structure as the pipeline used for CRyPTIC data.
  • Figure 3: Composite complexity in decision trees and random forests on the CRyPTIC dataset. $\mathrm{MSE}$ is plotted against model complexity for $P^{\text{leaf}}\in\{50,100,200,500\}$. Within each subplot, complexity increases first by growing single-tree capacity ($L_2$ to $L_{\max}$), then by increasing $P^{\text{ens}}$ (RF1 to RF50). The vertical dotted line marks the interpolation threshold.
  • Figure 4: Test $\mathrm{MSE}$ for tree-based models on the synthetic dataset. Left: composite complexity (increasing $P^{\text{leaf}}$ then $P^{\text{ens}}$). Middle: $\mathrm{MSE}$ vs. $P^{\text{leaf}}$ at fixed $P^{\text{ens}}$. Right: $\mathrm{MSE}$ vs. $P^{\text{ens}}$ at fixed $P^{\text{leaf}}$.
  • Figure 5: Gradient boosting on the CRyPTIC dataset. (A) Composite complexity: increasing $P^{\text{boost}}$ then $P^{\text{ens}}$. (B) $\mathrm{MSE}$ vs. $P^{\text{boost}}$ at fixed $P^{\text{ens}}$. (C) $\mathrm{MSE}$ vs. $P^{\text{ens}}$ at fixed $P^{\text{boost}}$.
  • ...and 2 more figures