Table of Contents
Fetching ...

A robust methodology for long-term sustainability evaluation of Machine Learning models

Jorge Paz-Ruza, João Gama, Amparo Alonso-Betanzos, Bertha Guijarro-Berdiñas

TL;DR

This work addresses the inadequacy of short-term, batch-centric sustainability assessments for ML systems by proposing a model-agnostic, long-term evaluation protocol that accommodates both batch and streaming learning. It assesses performance, sustainability, and data availability along a lifecycle trajectory with sequential data and prequential evaluation. Empirical results across multiple datasets reveal that long-term environmental cost can be large with marginal performance gains, and streaming approaches can rival batch methods on simpler tasks. The proposed protocol offers a practical framework for regulators and practitioners to evaluate ML sustainability in real-world, evolving usage scenarios, with broad implications for deploying energy-efficient AI.

Abstract

Sustainability and efficiency have become essential considerations in the development and deployment of Artificial Intelligence systems, yet existing regulatory and reporting practices lack standardized, model-agnostic evaluation protocols. Current assessments often measure only short-term experimental resource usage and disproportionately emphasize batch learning settings, failing to reflect real-world, long-term AI lifecycles. In this work, we propose a comprehensive evaluation protocol for assessing the long-term sustainability of ML models, applicable to both batch and streaming learning scenarios. Through experiments on diverse classification tasks using a range of model types, we demonstrate that traditional static train-test evaluations do not reliably capture sustainability under evolving data and repeated model updates. Our results show that long-term sustainability varies significantly across models, and in many cases, higher environmental cost yields little performance benefit.

A robust methodology for long-term sustainability evaluation of Machine Learning models

TL;DR

This work addresses the inadequacy of short-term, batch-centric sustainability assessments for ML systems by proposing a model-agnostic, long-term evaluation protocol that accommodates both batch and streaming learning. It assesses performance, sustainability, and data availability along a lifecycle trajectory with sequential data and prequential evaluation. Empirical results across multiple datasets reveal that long-term environmental cost can be large with marginal performance gains, and streaming approaches can rival batch methods on simpler tasks. The proposed protocol offers a practical framework for regulators and practitioners to evaluate ML sustainability in real-world, evolving usage scenarios, with broad implications for deploying energy-efficient AI.

Abstract

Sustainability and efficiency have become essential considerations in the development and deployment of Artificial Intelligence systems, yet existing regulatory and reporting practices lack standardized, model-agnostic evaluation protocols. Current assessments often measure only short-term experimental resource usage and disproportionately emphasize batch learning settings, failing to reflect real-world, long-term AI lifecycles. In this work, we propose a comprehensive evaluation protocol for assessing the long-term sustainability of ML models, applicable to both batch and streaming learning scenarios. Through experiments on diverse classification tasks using a range of model types, we demonstrate that traditional static train-test evaluations do not reliably capture sustainability under evolving data and repeated model updates. Our results show that long-term sustainability varies significantly across models, and in many cases, higher environmental cost yields little performance benefit.

Paper Structure

This paper contains 7 sections, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: Sustainability vs. Performance trade-off (left, higher is better) and impact of the no. of instances in model performance (centre, higher is better) and model sustainability (right, lower is better) for streaming and batch ML models on the QMNIST dataset.
  • Figure 2: Sustainability vs. Performance trade-off (left, higher is better) and impact of the no. of instances in model performance (centre, higher is better) and model sustainability (right, lower is better) for streaming and batch ML models on the ML-1M dataset.
  • Figure 3: Sustainability vs. Performance trade-off (left, higher is better) and impact of the no. of instances in model performance (centre, higher is better) and model sustainability (right, lower is better) for streaming and batch ML models on the Waveform 40 dataset.
  • Figure 4: Sustainability vs. Performance trade-off (left, higher is better) and impact of the no. of instances in model performance (center, higher is better) and model sustainability (right, lower is better) for streaming and batch ML models on the KDDCUP dataset.