Table of Contents
Fetching ...

Holistically Evaluating the Environmental Impact of Creating Language Models

Jacob Morrison, Clara Na, Jared Fernandez, Tim Dettmers, Emma Strubell, Jesse Dodge

TL;DR

This paper conducts a holistic environmental accounting for language-model development, training, and deployment by tracing emissions and water use across hardware manufacturing, model development, and final training for the OLMo series (20M–13B parameters, up to 5.6T tokens). It combines operational metrics with embodied emissions, using $CO_2e = P \cdot PUE \cdot CI$ and water-consumption models, and amortizes manufacturing impacts over hardware lifetimes. Key findings show total emissions around 493 tCO2eq and water use near 2,769 kL, with development contributing roughly half of training's footprint and power usage fluctuating during training due to checkpointing. The work underscores the need for greater transparency, policy standards, and engineering strategies to mitigate environmental impact as AI scales, including attention to embodied emissions, water use, and grid-operations implications of power spikes and drops.

Abstract

As the performance of artificial intelligence systems has dramatically increased, so too has the environmental impact of creating these systems. While many model developers release estimates of the power consumption and carbon emissions from the final training runs for their latest models, there is comparatively little transparency into the impact of model development, hardware manufacturing, and total water usage throughout. In this work, we estimate the real-world environmental impact of developing a series of language models, ranging from 20 million to 13 billion active parameters, trained on up to 5.6 trillion tokens each. When accounting for hardware manufacturing, model development, and our final training runs, we find that our series of models released 493 metric tons of carbon emissions, equivalent to powering about 98 homes in the United States for one year, and consumed 2.769 million liters of water, equivalent to about 24.5 years of water usage by a person in the United States, even though our data center is extremely water-efficient. We measure and report the environmental impact of our model development; to the best of our knowledge we are the first to do so for LLMs, and we find that model development, the impact of which is generally not disclosed by most model developers, amounted to ~50% of that of training. By looking at detailed time series data for power consumption, we also find that power usage throughout training is not consistent, fluctuating between ~15% and ~85% of our hardware's maximum power draw, with negative implications for grid-scale planning as demand continues to grow. We close with a discussion on the continued difficulty of estimating the environmental impact of AI systems, and key takeaways for model developers and the public at large.

Holistically Evaluating the Environmental Impact of Creating Language Models

TL;DR

This paper conducts a holistic environmental accounting for language-model development, training, and deployment by tracing emissions and water use across hardware manufacturing, model development, and final training for the OLMo series (20M–13B parameters, up to 5.6T tokens). It combines operational metrics with embodied emissions, using and water-consumption models, and amortizes manufacturing impacts over hardware lifetimes. Key findings show total emissions around 493 tCO2eq and water use near 2,769 kL, with development contributing roughly half of training's footprint and power usage fluctuating during training due to checkpointing. The work underscores the need for greater transparency, policy standards, and engineering strategies to mitigate environmental impact as AI scales, including attention to embodied emissions, water use, and grid-operations implications of power spikes and drops.

Abstract

As the performance of artificial intelligence systems has dramatically increased, so too has the environmental impact of creating these systems. While many model developers release estimates of the power consumption and carbon emissions from the final training runs for their latest models, there is comparatively little transparency into the impact of model development, hardware manufacturing, and total water usage throughout. In this work, we estimate the real-world environmental impact of developing a series of language models, ranging from 20 million to 13 billion active parameters, trained on up to 5.6 trillion tokens each. When accounting for hardware manufacturing, model development, and our final training runs, we find that our series of models released 493 metric tons of carbon emissions, equivalent to powering about 98 homes in the United States for one year, and consumed 2.769 million liters of water, equivalent to about 24.5 years of water usage by a person in the United States, even though our data center is extremely water-efficient. We measure and report the environmental impact of our model development; to the best of our knowledge we are the first to do so for LLMs, and we find that model development, the impact of which is generally not disclosed by most model developers, amounted to ~50% of that of training. By looking at detailed time series data for power consumption, we also find that power usage throughout training is not consistent, fluctuating between ~15% and ~85% of our hardware's maximum power draw, with negative implications for grid-scale planning as demand continues to grow. We close with a discussion on the continued difficulty of estimating the environmental impact of AI systems, and key takeaways for model developers and the public at large.

Paper Structure

This paper contains 27 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Environmental impact for a selection of the final training runs described in Section \ref{['sec:building-models']}, where we rank each model by both its total water consumption and its CO2 emissions. Our small models ($<$1B parameters) were trained on 1.7 trillion tokens, OLMo 1B was trained on 3 trillion, OLMo 2 7B was trained on 4 trillion, OLMoE was trained on 5 trillion, and OLMo 2 13B was trained on 5.6 trillion. We see that the total environmental impact for larger training runs is quite high, and increases quickly with model and dataset size.
  • Figure 2: Average GPU power for a single node for the first 300 logging steps during OLMo 2 7B training. The first spike is the beginning of training, and each drop happens when a model checkpoint is saved. When actively training, the average GPU power is over 600W, over 85% of an H100's maximum power draw of 700W, and during checkpointing, power usage drops to just over 100W, or about 15% maximum.