Energy and Carbon Considerations of Fine-Tuning BERT

Xiaorong Wang; Clara Na; Emma Strubell; Sorelle Friedler; Sasha Luccioni

Energy and Carbon Considerations of Fine-Tuning BERT

Xiaorong Wang, Clara Na, Emma Strubell, Sorelle Friedler, Sasha Luccioni

TL;DR

A careful empirical study of the computational costs of fine-tuning across tasks, datasets, hardware infrastructure and measurement modalities is performed to place fine- Tuning energy and carbon costs into perspective with respect to pre-training and inference.

Abstract

Despite the popularity of the `pre-train then fine-tune' paradigm in the NLP community, existing work quantifying energy costs and associated carbon emissions has largely focused on language model pre-training. Although a single pre-training run draws substantially more energy than fine-tuning, fine-tuning is performed more frequently by many more individual actors, and thus must be accounted for when considering the energy and carbon footprint of NLP. In order to better characterize the role of fine-tuning in the landscape of energy and carbon emissions in NLP, we perform a careful empirical study of the computational costs of fine-tuning across tasks, datasets, hardware infrastructure and measurement modalities. Our experimental results allow us to place fine-tuning energy and carbon costs into perspective with respect to pre-training and inference, and outline recommendations to NLP researchers and practitioners who wish to improve their fine-tuning energy efficiency.

Energy and Carbon Considerations of Fine-Tuning BERT

TL;DR

Abstract

Paper Structure (23 sections, 3 equations, 11 figures, 10 tables)

This paper contains 23 sections, 3 equations, 11 figures, 10 tables.

Introduction
Related Work
Methodology
Pre-training BERT
Fine-tuning BERT
Hardware Platforms
Measuring Energy and Carbon
Results and Discussion
Pre-training and Distillation
Comparing and Predicting Fine-tuning Emissions
Fine-tuning vs Pre-training
Fine-tuning vs. Inference
Single vs. Multiple GPUs
Conclusion
Appendix
...and 8 more sections

Figures (11)

Figure 1: Total energy consumed (kWh) is strongly related to number of tokens seen for BERT models on both A100 and RTX 8000 GPU machines, although the relationship is more predictive on the RTX 8000 machine and energy usage is less consistent with DistilBERT. Note that both axes are in log scale. An alternative view of similar data in Figure \ref{['fig:tokens-herron']} distinguishes pre-training workloads' energy consumption slightly from that of fine-tuning tasks.
Figure 2: Kill-A-Watt wall reading measurement setup
Figure 3: Energy consumed (kWh) against total tokens seen on A100 machine
Figure 4: Energy consumed (kWh) against total tokens seen on RTX8000 machine
Figure 5: Energy consumed (kWh) against training time (s) on A100 machine, showing the first 30 minutes
...and 6 more figures

Energy and Carbon Considerations of Fine-Tuning BERT

TL;DR

Abstract

Energy and Carbon Considerations of Fine-Tuning BERT

Authors

TL;DR

Abstract

Table of Contents

Figures (11)