Towards Sustainable Large Language Model Serving

Sophia Nguyen; Beihao Zhou; Yi Ding; Sihang Liu

Towards Sustainable Large Language Model Serving

Sophia Nguyen, Beihao Zhou, Yi Ding, Sihang Liu

TL;DR

This work characterize the performance and energy of LLaMA with 1B, 3B, and 7B parameters using two Nvidia GPU types, a latest-generation RTX6000 Ada and an older-generation T4, and analytically model operational and embodied carbon emissions based on energy consumption and carbon intensities.

Abstract

In this work, we study LLMs from a carbon emission perspective, addressing both operational and embodied emissions, and paving the way for sustainable LLM serving. We characterize the performance and energy of LLaMA with 1B, 3B, and 7B parameters using two Nvidia GPU types, a latest-generation RTX6000 Ada and an older-generation T4. We analytically model operational carbon emissions based on energy consumption and carbon intensities from three grid regions -- each representing a different energy source mix, and embodied carbon emissions based on chip area and memory size. Our characterization and modeling provide us with an in-depth understanding of the performance, energy, and carbon emissions of LLM serving. Our findings highlight the potential for optimizing sustainable LLM serving systems by considering both operational and embodied carbon emissions simultaneously.

Towards Sustainable Large Language Model Serving

TL;DR

Abstract

Paper Structure (12 sections, 4 equations, 7 figures, 2 tables)

This paper contains 12 sections, 4 equations, 7 figures, 2 tables.

Introduction
LLM Characterization
Methodology
Latency vs. Energy Consumption
Prefill and Decode Phases
Carbon Emission Analysis
Methodology
Carbon Emissions in Different Regions
Carbon Emissions in Prefill/Decode Phases
Impact of Extending GPU Lifetime
Future Directions
Conclusions

Figures (7)

Figure 1: Latency and energy consumption of RTX6000 Ada and T4 under different parameter sizes and batch sizes. "OOM"=out of memory.
Figure 2: Throughput and energy in the prefill phase (1B-parameter LLaMA).
Figure 3: Throughput and energy in the decode phase (1B-parameter LLaMA).
Figure 4: Per-prompt carbon emission under the QC, CISO, and PACE grids (1B-parameter LLaMA).
Figure 5: Per-token carbon emission in the prefill phase under the CISO grid (1B parameters).
...and 2 more figures

Towards Sustainable Large Language Model Serving

TL;DR

Abstract

Towards Sustainable Large Language Model Serving

Authors

TL;DR

Abstract

Table of Contents

Figures (7)