Table of Contents
Fetching ...

PePR: Performance Per Resource Unit as a Metric to Promote Small-Scale Deep Learning in Medical Image Analysis

Raghavendra Selvan, Bob Pepin, Christian Igel, Gabrielle Samuel, Erik B Dam

TL;DR

This work addresses the growing environmental and equity concerns of resource-intensive deep learning in medical image analysis by introducing PePR, a composite metric that computes $P_{\text{ePR}}(R,P)=\frac{P}{1+R}$ to balance performance $P$ with normalized resource cost $R$. The authors evaluate 131 pretrained architectures across three datasets, demonstrating that small-scale, pretrained models often yield better performance-per-resource trade-offs in resource-constrained settings. They also formalize PePR curves and plan for variants capturing different costs (e.g., energy, memory, carbon), showing that PePR can guide resource-aware model selection and promote AI equity in healthcare. The findings support prioritizing small-scale, well-pretrained architectures to reduce compute, data, and energy requirements while maintaining useful predictive performance.

Abstract

The recent advances in deep learning (DL) have been accelerated by access to large-scale data and compute. These large-scale resources have been used to train progressively larger models which are resource intensive in terms of compute, data, energy, and carbon emissions. These costs are becoming a new type of entry barrier to researchers and practitioners with limited access to resources at such scale, particularly in the Global South. In this work, we take a comprehensive look at the landscape of existing DL models for medical image analysis tasks and demonstrate their usefulness in settings where resources are limited. To account for the resource consumption of DL models, we introduce a novel measure to estimate the performance per resource unit, which we call the PePR score. Using a diverse family of 131 unique DL architectures (spanning 1M to 130M trainable parameters) and three medical image datasets, we capture trends about the performance-resource trade-offs. In applications like medical image analysis, we argue that small-scale, specialized models are better than striving for large-scale models. Furthermore, we show that using existing pretrained models that are fine-tuned on new data can significantly reduce the computational resources and data required compared to training models from scratch. We hope this work will encourage the community to focus on improving AI equity by developing methods and models with smaller resource footprints.

PePR: Performance Per Resource Unit as a Metric to Promote Small-Scale Deep Learning in Medical Image Analysis

TL;DR

This work addresses the growing environmental and equity concerns of resource-intensive deep learning in medical image analysis by introducing PePR, a composite metric that computes to balance performance with normalized resource cost . The authors evaluate 131 pretrained architectures across three datasets, demonstrating that small-scale, pretrained models often yield better performance-per-resource trade-offs in resource-constrained settings. They also formalize PePR curves and plan for variants capturing different costs (e.g., energy, memory, carbon), showing that PePR can guide resource-aware model selection and promote AI equity in healthcare. The findings support prioritizing small-scale, well-pretrained architectures to reduce compute, data, and energy requirements while maintaining useful predictive performance.

Abstract

The recent advances in deep learning (DL) have been accelerated by access to large-scale data and compute. These large-scale resources have been used to train progressively larger models which are resource intensive in terms of compute, data, energy, and carbon emissions. These costs are becoming a new type of entry barrier to researchers and practitioners with limited access to resources at such scale, particularly in the Global South. In this work, we take a comprehensive look at the landscape of existing DL models for medical image analysis tasks and demonstrate their usefulness in settings where resources are limited. To account for the resource consumption of DL models, we introduce a novel measure to estimate the performance per resource unit, which we call the PePR score. Using a diverse family of 131 unique DL architectures (spanning 1M to 130M trainable parameters) and three medical image datasets, we capture trends about the performance-resource trade-offs. In applications like medical image analysis, we argue that small-scale, specialized models are better than striving for large-scale models. Furthermore, we show that using existing pretrained models that are fine-tuned on new data can significantly reduce the computational resources and data required compared to training models from scratch. We hope this work will encourage the community to focus on improving AI equity by developing methods and models with smaller resource footprints.
Paper Structure (13 sections, 4 equations, 6 figures, 2 tables)

This paper contains 13 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Number of publications per capita in different regions of the world for 2013 and 2022 on the topic broadly seen as "Artificial Intelligence". A large gap continues to persist in regions from the Global South compared to other well-performing regions, primarily in the Global North . Data source: OECD.ai
  • Figure 2: (a) Idealized PePR-E profile. (b) Performance curve for ESE-VoVNet$^*$lee2019energy. The orange point marks $P_{\text{ePRc}}^*$, beyond which the performance curve enters the region of diminishing returns. (c) Number of trainable parameters and energy consumption for the $131$ models, demonstrating a large variability in model scale. The vertical red line demarcates the median point for number of trainable parameters.
  • Figure 3: a) Violin plot showing the influence of fine-tuning the pretrained models for ten epochs versus training the models from scratch for ten epochs for all $131$ models. b) Violin plot showing the influence on test performance of fine-tuning all models on $100\%$ and $10\%$ of training data, across all three datasets. (c) Test performance $P\in [0,1]$ averaged over three datasets for each of the $131$ models, fine-tuned for 10 epochs, against the number of trainable parameters on $\log_{10}$ scale. (d) PePR-E score for the $131$ models averaged over the three datasets.
  • Figure 4: a) Test accuracy against normalized energy used for training on Derma dataset. Points correspond to combinations of model and training epoch. Orange points lie on the Pareto frontier. Background shaded according to PePR-e score. b) Validation performance and the corresponding PePR-M scores for all models trained until convergence on ImageNet dataset using the publicly available data from rw2019timm. PePR score shows that smaller models achieve a better performance and resource trade-off.
  • Figure A.1: Median PePR-e score for small models ($\leq$ 24.6M parameters) and large models ($>$ 24.6M parameters). All differences are significant ($p < 0.05$).
  • ...and 1 more figures