Table of Contents
Fetching ...

An Empirical Study of Sustainability in Prompt-driven Test Script Generation Using Small Language Models

Pragati Kumari, Novarun Deb

Abstract

The increasing use of language models in automated test script generation raises concerns about their environmental impact, yet existing sustainability analyses focus predominantly on large language models. As a result, the energy and carbon characteristics of small language models (SLMs) during prompt-driven unit-test script generation remain largely unexplored. To address this gap, this study empirically examines the environmental and performance tradeoffs of SLMs (in the 2B-8B parameter range) using the HumanEval benchmark and adaptive prompt variants (based on the Anthropic template). The analysis uses CodeCarbon to characterize energy consumption carbon emissions and duration under controlled conditions, with unit-test script coverage serving as an initial proxy for generated test quality. Our results show that different SLMs exhibit distinct sustainability profiles - some favor lower energy use and faster execution, while others maintain higher stability or coverage under comparable conditions. Overall, this work provides focused empirical evidence on sustainable SLM-based test script generation, clarifying how prompt structure and model selection jointly shape environmental and performance outcomes.

An Empirical Study of Sustainability in Prompt-driven Test Script Generation Using Small Language Models

Abstract

The increasing use of language models in automated test script generation raises concerns about their environmental impact, yet existing sustainability analyses focus predominantly on large language models. As a result, the energy and carbon characteristics of small language models (SLMs) during prompt-driven unit-test script generation remain largely unexplored. To address this gap, this study empirically examines the environmental and performance tradeoffs of SLMs (in the 2B-8B parameter range) using the HumanEval benchmark and adaptive prompt variants (based on the Anthropic template). The analysis uses CodeCarbon to characterize energy consumption carbon emissions and duration under controlled conditions, with unit-test script coverage serving as an initial proxy for generated test quality. Our results show that different SLMs exhibit distinct sustainability profiles - some favor lower energy use and faster execution, while others maintain higher stability or coverage under comparable conditions. Overall, this work provides focused empirical evidence on sustainable SLM-based test script generation, clarifying how prompt structure and model selection jointly shape environmental and performance outcomes.

Paper Structure

This paper contains 19 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The Empirical Study pipeline
  • Figure 2: SCI evaluations for different model-prompt configurations w.r.t. the geographic region where the inference was performed by Google Colab.
  • Figure 3: SVI scores for different model-prompt configurations w.r.t. the geographic region where the inference was performed by Google Colab.
  • Figure 4: SVI scores for different quantizations of Phi-3.5-mini and Qwen2.5-1.5B w.r.t. the geographic region where the inference was performed by Google Colab.