Table of Contents
Fetching ...

Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View

Yanran Wu, Inez Hua, Yi Ding

TL;DR

The paper introduces FUEL, a Functional Unit-based Evaluation framework for quantifying the environmental impact of LLM serving. By defining an FU as a token generated under explicit workload, performance, and quality constraints, FUEL enables fair, cross-model comparisons across model size, quantization, and hardware. Through three case studies using Qwen2.5 and Llama2 on vLLM with NewsQA as the benchmark, the work reveals nuanced tradeoffs: larger models can be greener at high-quality, low-QPS regimes; activation quantization (W8A8) often yields robust carbon reductions; and older hardware can outperform newer systems in carbon efficiency under certain conditions. These insights provide practical guidance for deploying greener LLM services and underscore the importance of joint optimization across model configuration and infrastructure. The authors also release code at GitHub for reproducible, FU-based environmental assessments.

Abstract

Large language models (LLMs) offer powerful capabilities but come with significant environmental impact, particularly in carbon emissions. Existing studies benchmark carbon emissions but lack a standardized basis for comparison across different model configurations. To address this, we introduce the concept of functional unit (FU) as a standardized basis and develop FUEL, the first FU-based framework for evaluating LLM serving's environmental impact. Through three case studies, we uncover key insights and trade-offs in reducing carbon emissions by optimizing model size, quantization strategy, and hardware choice, paving the way for more sustainable LLM serving. The code is available at https://github.com/jojacola/FUEL.

Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View

TL;DR

The paper introduces FUEL, a Functional Unit-based Evaluation framework for quantifying the environmental impact of LLM serving. By defining an FU as a token generated under explicit workload, performance, and quality constraints, FUEL enables fair, cross-model comparisons across model size, quantization, and hardware. Through three case studies using Qwen2.5 and Llama2 on vLLM with NewsQA as the benchmark, the work reveals nuanced tradeoffs: larger models can be greener at high-quality, low-QPS regimes; activation quantization (W8A8) often yields robust carbon reductions; and older hardware can outperform newer systems in carbon efficiency under certain conditions. These insights provide practical guidance for deploying greener LLM services and underscore the importance of joint optimization across model configuration and infrastructure. The authors also release code at GitHub for reproducible, FU-based environmental assessments.

Abstract

Large language models (LLMs) offer powerful capabilities but come with significant environmental impact, particularly in carbon emissions. Existing studies benchmark carbon emissions but lack a standardized basis for comparison across different model configurations. To address this, we introduce the concept of functional unit (FU) as a standardized basis and develop FUEL, the first FU-based framework for evaluating LLM serving's environmental impact. Through three case studies, we uncover key insights and trade-offs in reducing carbon emissions by optimizing model size, quantization strategy, and hardware choice, paving the way for more sustainable LLM serving. The code is available at https://github.com/jojacola/FUEL.

Paper Structure

This paper contains 47 sections, 5 equations, 38 figures, 3 tables.

Figures (38)

  • Figure 1: Overview of FUEL framework.
  • Figure 2: Carbon emission per FU for different model sizes across Qscores at QPS=1 req/s.
  • Figure 3: Qscore distribution of outputs across different model sizes on the NewsQA dataset.
  • Figure 4: Carbon savings of Qwen 14B and 32B compared to 7B with Qscore low (-5) and high (15). Data for Qwen 32B are missing at QPS > 4 req/s, as larger models cannot serve intensive workloads.
  • Figure 5: Carbon savings of Llama 13B compared to 7B with Qscore low (-5) and high (10).
  • ...and 33 more figures