Table of Contents
Fetching ...

Stop Preaching and Start Practising Data Frugality for Responsible Development of AI

Sophia N. Wilson, Guðrún Fjóla Guðmundsdóttir, Andrew Millard, Raghavendra Selvan, Sebastian Mair

TL;DR

This position paper argues that the machine learning community must move from preaching to practising data frugality for responsible artificial intelligence (AI) development, and presents empirical evidence that data frugality is both practical and beneficial.

Abstract

This position paper argues that the machine learning community must move from preaching to practising data frugality for responsible artificial intelligence (AI) development. For long, progress has been equated with ever-larger datasets, driving remarkable advances but now yielding increasingly diminishing performance gains alongside rising energy use and carbon emissions. While awareness of data frugal approaches has grown, their adoption has remained rhetorical, and data scaling continues to dominate development practice. We argue that this gap between preach and practice must be closed, as continued data scaling entails substantial and under-accounted environmental impacts. To ground our position, we provide indicative estimates of the energy use and carbon emissions associated with the downstream use of ImageNet-1K. We then present empirical evidence that data frugality is both practical and beneficial, demonstrating that coreset-based subset selection can substantially reduce training energy consumption with little loss in accuracy, while also mitigating dataset bias. Finally, we outline actionable recommendations for moving data frugality from rhetorical preach to concrete practice for responsible development of AI.

Stop Preaching and Start Practising Data Frugality for Responsible Development of AI

TL;DR

This position paper argues that the machine learning community must move from preaching to practising data frugality for responsible artificial intelligence (AI) development, and presents empirical evidence that data frugality is both practical and beneficial.

Abstract

This position paper argues that the machine learning community must move from preaching to practising data frugality for responsible artificial intelligence (AI) development. For long, progress has been equated with ever-larger datasets, driving remarkable advances but now yielding increasingly diminishing performance gains alongside rising energy use and carbon emissions. While awareness of data frugal approaches has grown, their adoption has remained rhetorical, and data scaling continues to dominate development practice. We argue that this gap between preach and practice must be closed, as continued data scaling entails substantial and under-accounted environmental impacts. To ground our position, we provide indicative estimates of the energy use and carbon emissions associated with the downstream use of ImageNet-1K. We then present empirical evidence that data frugality is both practical and beneficial, demonstrating that coreset-based subset selection can substantially reduce training energy consumption with little loss in accuracy, while also mitigating dataset bias. Finally, we outline actionable recommendations for moving data frugality from rhetorical preach to concrete practice for responsible development of AI.
Paper Structure (19 sections, 2 equations, 9 figures, 3 tables)

This paper contains 19 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Illustration is inspired by a remark by prominent ML researcher Andrew Ng, comparing an AI model with a rocket ship kelly2016inevitable. However, progress in rocket science depends less on adding more fuel and more on using it wisely, reflecting the principle of data frugality. Created with Nano Banana (Google, 2026).
  • Figure 2: Visualisation of a data lifecycle (purple) and a model lifecycle (blue). Subset selection methods can have positive effects on multiple parts across both lifecycles (grey).
  • Figure 3: Top-1 accuracy of pruning ImageNet-1K using Dyn-Unc (left), D2 (right), and InfoMax (right). Forgetting, Random, and the performance on the full ImageNet-1K are shown as references. The numbers are taken from he2024large and tan2025data. Note that the architectures differ, there is no performance loss for 25%-30% of data pruning when using Dyn-Unc/InfoMax, and that the authors do not compare their methods against each other.
  • Figure 4: Performance of a classifier trained on the biased Colour-MNIST (99% majority group) dataset is reported for the case with no bias (aligned) and with bias (conflicting). Three coreset methods are shown: random (baseline), reweighted, and balanced.
  • Figure 5: Left: Observed and projected fraction of papers at ICLR with ImageNet use. Right: Observed and projected growth of papers using ImageNet to train models from random initialization.
  • ...and 4 more figures