Table of Contents
Fetching ...

An automated machine learning framework to optimize radiomics model construction validated on twelve clinical applications

Martijn P. A. Starmans, Sebastian R. van der Voort, Thomas Phil, Milea J. M. Timbergen, Melissa Vos, Guillaume A. Padmos, Wouter Kessels, David Hanff, Dirk J. Grunhagen, Cornelis Verhoef, Stefan Sleijfer, Martin J. van den Bent, Marion Smits, Roy S. Dwarkasing, Christopher J. Els, Federico Fiduzi, Geert J. L. H. van Leenders, Anela Blazevic, Johannes Hofland, Tessa Brabander, Renza A. H. van Gils, Gaston J. H. Franssen, Richard A. Feelders, Wouter W. de Herder, Florian E. Buisman, Francois E. J. A. Willemssen, Bas Groot Koerkamp, Lindsay Angus, Astrid A. M. van der Veldt, Ana Rajicic, Arlette E. Odink, Mitchell Deen, Jose M. Castillo T., Jifke Veenland, Ivo Schoots, Michel Renckens, Michail Doukas, Rob A. de Man, Jan N. M. IJzermans, Razvan L. Miclea, Peter B. Vermeulen, Esther E. Bron, Maarten G. Thomeer, Jacob J. Visser, Wiro J. Niessen, Stefan Klein

TL;DR

This study tackles the reproducibility and efficiency bottlenecks in radiomics by introducing WORC, an automated, modular AutoML framework that optimizes complete radiomics workflows per clinical application through a Combined Algorithm Selection and Hyperparameter (CASH) formulation. It compares random search and Bayesian optimization (SMAC) with three ensembling strategies, showing that a medium-budget random search with simple averaging yields comparable performance to more complex methods while improving stability. Across twelve clinical applications, WORC outperforms a conventional radiomics baseline and often matches or exceeds human expert performance, demonstrating strong generalization and robustness on multi-center data. By releasing six public datasets (930 patients) and the WORC toolbox, the work advances reproducibility and provides a scalable path to automated, cross-application radiomics model construction.

Abstract

Predicting clinical outcomes from medical images using quantitative features (``radiomics'') requires many method design choices, Currently, in new clinical applications, finding the optimal radiomics method out of the wide range of methods relies on a manual, heuristic trial-and-error process. We introduce a novel automated framework that optimizes radiomics workflow construction per application by standardizing the radiomics workflow in modular components, including a large collection of algorithms for each component, and formulating a combined algorithm selection and hyperparameter optimization problem. To solve it, we employ automated machine learning through two strategies (random search and Bayesian optimization) and three ensembling approaches. Results show that a medium-sized random search and straight-forward ensembling perform similar to more advanced methods while being more efficient. Validated across twelve clinical applications, our approach outperforms both a radiomics baseline and human experts. Concluding, our framework improves and streamlines radiomics research by fully automatically optimizing radiomics workflow construction. To facilitate reproducibility, we publicly release six datasets, software of the method, and code to reproduce this study.

An automated machine learning framework to optimize radiomics model construction validated on twelve clinical applications

TL;DR

This study tackles the reproducibility and efficiency bottlenecks in radiomics by introducing WORC, an automated, modular AutoML framework that optimizes complete radiomics workflows per clinical application through a Combined Algorithm Selection and Hyperparameter (CASH) formulation. It compares random search and Bayesian optimization (SMAC) with three ensembling strategies, showing that a medium-budget random search with simple averaging yields comparable performance to more complex methods while improving stability. Across twelve clinical applications, WORC outperforms a conventional radiomics baseline and often matches or exceeds human expert performance, demonstrating strong generalization and robustness on multi-center data. By releasing six public datasets (930 patients) and the WORC toolbox, the work advances reproducibility and provides a scalable path to automated, cross-application radiomics model construction.

Abstract

Predicting clinical outcomes from medical images using quantitative features (``radiomics'') requires many method design choices, Currently, in new clinical applications, finding the optimal radiomics method out of the wide range of methods relies on a manual, heuristic trial-and-error process. We introduce a novel automated framework that optimizes radiomics workflow construction per application by standardizing the radiomics workflow in modular components, including a large collection of algorithms for each component, and formulating a combined algorithm selection and hyperparameter optimization problem. To solve it, we employ automated machine learning through two strategies (random search and Bayesian optimization) and three ensembling approaches. Results show that a medium-sized random search and straight-forward ensembling perform similar to more advanced methods while being more efficient. Validated across twelve clinical applications, our approach outperforms both a radiomics baseline and human experts. Concluding, our framework improves and streamlines radiomics research by fully automatically optimizing radiomics workflow construction. To facilitate reproducibility, we publicly release six datasets, software of the method, and code to reproduce this study.

Paper Structure

This paper contains 30 sections, 4 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Schematic overview of the workflow search space in our framework. The search space consists of various sequential sets of algorithms, where each algorithm may include various hyperparameters, as indicated by the leaves in the trees. An example of a workflow, i.e., a specific combination of algorithms and parameters, is indicated by the gray nodes. Abbreviations: AdaBoost: adaptive boosting; ADASYN: adaptive synthetic sampling; KNN: k-nearest neighbor; GLCM: gray level co-occurence matrix; SMOTE: synthetic minority oversampling technique; SVM: support vector machine.
  • Figure 2: Error plots of the area under the receiver operating characteristic curve (AUC) on the test datasets of the radiomics models on six datasets (Lipo, Desmoid, Liver, GIST, CRLM, and Melanoma) for two optimization strategies (RS: random search with $N_{RS}=1000$, SMAC: sequential model-based algorithm configuration with different computational budgets (low, medium, high)) and a radiomics state-of-the-art (SOTA) baseline, when using either the single best found validation workflow (1) or one of three ensembling strategies (100: $N_{\text{ens}} = 100$, FN: FitNumber, FS: ForwardSelection). The error plots represent 95% confidence intervals, estimated through $k_{\text{test}}=20$ random-split cross-validation on the entire dataset. The circle represents the mean.
  • Figure 3: Error plots of the area under the receiver operating characteristic curve (AUC) of the radiomics models on twelve datasets. The error plots represent the 95% confidence intervals, estimated through $k_{\text{test}}=100$ random-split cross-validation on the entire dataset (all except Glioma) or through 1000x bootstrap resampling of the independent test set (Glioma). The circle represents the mean (all except Glioma) or point estimate (Glioma), which is also stated to the right of each circle. The dashed line corresponds to the AUC of random guessing (0.50).
  • Figure A.1: Cross-validation setups used by our WORC framework for optimization and evaluation. When a single dataset is used, internal validation is performed through a $k_{\text{test}}=100$ random-split cross-validation (A). When fixed, separate training and test datasets are used, external validation is performed by developing the model on the training set and evaluating the performance on the test set through 1000x bootstrap resampling (B). Both include an internal $k_{\text{training}}=5$ random-split cross-validation on the training set to split the training set into parts for actual training and validation, in which the model optimization is performed. The final selected model, trained on the full training dataset, is used for independent testing on the test dataset.
  • Figure A.2: Error plots of the weighted $F_{1,w}$ score on the validation datasets of the radiomics models on six datasets (Lipo, Desmoid, Liver, GIST, CRLM, and Melanoma) for two optimization strategies (RS: random search, SMAC: sequential model-based algorithm configuration) with different computational budgets (low, medium, high) and a radiomics state-of-the-art (SOTA) baseline, when using either the single best found validation workflow (1) or one of three ensembling strategies (100: $N_{\text{ens}} = 100$, FN: fit number, FS: forward selection RN69). The error plots represent the 95% confidence intervals, estimated through $k_{\text{test}}=20$ random-split cross-validation on the validation dataset. The circle represents the mean.
  • ...and 1 more figures