Table of Contents
Fetching ...

Energy and Carbon Considerations of Fine-Tuning BERT

Xiaorong Wang, Clara Na, Emma Strubell, Sorelle Friedler, Sasha Luccioni

TL;DR

A careful empirical study of the computational costs of fine-tuning across tasks, datasets, hardware infrastructure and measurement modalities is performed to place fine- Tuning energy and carbon costs into perspective with respect to pre-training and inference.

Abstract

Despite the popularity of the `pre-train then fine-tune' paradigm in the NLP community, existing work quantifying energy costs and associated carbon emissions has largely focused on language model pre-training. Although a single pre-training run draws substantially more energy than fine-tuning, fine-tuning is performed more frequently by many more individual actors, and thus must be accounted for when considering the energy and carbon footprint of NLP. In order to better characterize the role of fine-tuning in the landscape of energy and carbon emissions in NLP, we perform a careful empirical study of the computational costs of fine-tuning across tasks, datasets, hardware infrastructure and measurement modalities. Our experimental results allow us to place fine-tuning energy and carbon costs into perspective with respect to pre-training and inference, and outline recommendations to NLP researchers and practitioners who wish to improve their fine-tuning energy efficiency.

Energy and Carbon Considerations of Fine-Tuning BERT

TL;DR

A careful empirical study of the computational costs of fine-tuning across tasks, datasets, hardware infrastructure and measurement modalities is performed to place fine- Tuning energy and carbon costs into perspective with respect to pre-training and inference.

Abstract

Despite the popularity of the `pre-train then fine-tune' paradigm in the NLP community, existing work quantifying energy costs and associated carbon emissions has largely focused on language model pre-training. Although a single pre-training run draws substantially more energy than fine-tuning, fine-tuning is performed more frequently by many more individual actors, and thus must be accounted for when considering the energy and carbon footprint of NLP. In order to better characterize the role of fine-tuning in the landscape of energy and carbon emissions in NLP, we perform a careful empirical study of the computational costs of fine-tuning across tasks, datasets, hardware infrastructure and measurement modalities. Our experimental results allow us to place fine-tuning energy and carbon costs into perspective with respect to pre-training and inference, and outline recommendations to NLP researchers and practitioners who wish to improve their fine-tuning energy efficiency.
Paper Structure (23 sections, 3 equations, 11 figures, 10 tables)

This paper contains 23 sections, 3 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Total energy consumed (kWh) is strongly related to number of tokens seen for BERT models on both A100 and RTX 8000 GPU machines, although the relationship is more predictive on the RTX 8000 machine and energy usage is less consistent with DistilBERT. Note that both axes are in log scale. An alternative view of similar data in Figure \ref{['fig:tokens-herron']} distinguishes pre-training workloads' energy consumption slightly from that of fine-tuning tasks.
  • Figure 2: Kill-A-Watt wall reading measurement setup
  • Figure 3: Energy consumed (kWh) against total tokens seen on A100 machine
  • Figure 4: Energy consumed (kWh) against total tokens seen on RTX8000 machine
  • Figure 5: Energy consumed (kWh) against training time (s) on A100 machine, showing the first 30 minutes
  • ...and 6 more figures