AI-CARE: Carbon-Aware Reporting Evaluation Metric for AI Models

KC Santosh; Srikanth Baride; Rodrigue Rizk

AI-CARE: Carbon-Aware Reporting Evaluation Metric for AI Models

KC Santosh, Srikanth Baride, Rodrigue Rizk

TL;DR

AI-CARE tackles the gap in ML benchmarking by introducing a standardized carbon-aware evaluation tool that reports energy consumption and carbon emissions alongside task performance. It defines the carbon-performance tradeoff curve and a scalar score to enable multi-objective comparisons, formalized through the Pareto frontier $𝒯$ and the scalar $SCAS$ that balance $P(m)$ and $C(m)$. The approach is architecture- and workflow-agnostic and is validated across multiple vision benchmarks, showing that carbon-aware benchmarking can reorder rankings and incentivize environmentally responsible designs. The work provides open-source tooling to integrate energy and carbon accounting into existing evaluation pipelines, aligning ML progress with sustainability goals.

Abstract

As machine learning (ML) continues its rapid expansion, the environmental cost of model training and inference has become a critical societal concern. Existing benchmarks overwhelmingly focus on standard performance metrics such as accuracy, BLEU, or mAP, while largely ignoring energy consumption and carbon emissions. This single-objective evaluation paradigm is increasingly misaligned with the practical requirements of large-scale deployment, particularly in energy-constrained environments such as mobile devices, developing regions, and climate-aware enterprises. In this paper, we propose AI-CARE, an evaluation tool for reporting energy consumption, and carbon emissions of ML models. In addition, we introduce the carbon-performance tradeoff curve, an interpretable tool that visualizes the Pareto frontier between performance and carbon cost. We demonstrate, through theoretical analysis and empirical validation on representative ML workloads, that carbon-aware benchmarking changes the relative ranking of models and encourages architectures that are simultaneously accurate and environmentally responsible. Our proposal aims to shift the research community toward transparent, multi-objective evaluation and align ML progress with global sustainability goals. The tool and documentation are available at https://github.com/USD-AI-ResearchLab/ai-care.

AI-CARE: Carbon-Aware Reporting Evaluation Metric for AI Models

TL;DR

and the scalar

that balance

and

. The approach is architecture- and workflow-agnostic and is validated across multiple vision benchmarks, showing that carbon-aware benchmarking can reorder rankings and incentivize environmentally responsible designs. The work provides open-source tooling to integrate energy and carbon accounting into existing evaluation pipelines, aligning ML progress with sustainability goals.

Abstract

Paper Structure (7 sections, 6 equations, 3 figures, 1 algorithm)

This paper contains 7 sections, 6 equations, 3 figures, 1 algorithm.

Introduction
Energy Is All You Forgot
Standardized Carbon-Aware Evaluation in ML
Design and Implementation
Problem Statement
Experiments and Results
Conclusion

Figures (3)

Figure 1: Sequence-style control-flow diagram of the proposed reporting framework. The main process spawns external monitoring and carbon-emission services and executes models on the evaluation dataset. During execution, energy consumption $E(m)$ is measured periodically, while carbon-emission values are retrieved asynchronously. After execution completes, performance, energy, and carbon metrics are normalized and aggregated to generate final reports. The framework operates purely as a reporting layer and does not influence model execution, training, or optimization.
Figure 2: Global carbon--performance tradeoff across all evaluated model--dataset pairs. Each subplot reports task performance (Accuracy, Precision, Recall, and F1-Score) versus total carbon emissions (training + inference) shown on a logarithmic scale. Colors indicate datasets, while marker shapes denote model families.
Figure 3: Metric-wise scalar carbon-aware scores across datasets and model families. Rows correspond to evaluation metrics (accuracy, precision, recall, and F1-score), while columns correspond to datasets (MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet-100). Each score integrates normalized task performance with total carbon emissions (training + inference). The grouped layout enables direct cross-metric and cross-dataset comparison of carbon-aware model rankings.

AI-CARE: Carbon-Aware Reporting Evaluation Metric for AI Models

TL;DR

Abstract

AI-CARE: Carbon-Aware Reporting Evaluation Metric for AI Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)