Execution-Based Evaluation of Natural Language to Bash and PowerShell for Incident Remediation
Ngoc Phuoc An Vo, Brent Paulovicks, Vadim Sheinin
TL;DR
This work addresses the problem of reliably evaluating natural language to Bash and PowerShell code for automated incident remediation in APM settings by introducing an execution-based evaluation platform. It constructs 125 handcrafted test cases across single-line Bash, multi-line Bash scripts, and PowerShell, and benchmarks seven LLMs under zero-shot and few-shot settings. The platform relies on a containerized, end-to-end pipeline to execute generated code, compare actual outcomes with expected results, and perform detailed error analyses to reveal functional correctness beyond surface-form similarity. The findings show that execution-based evaluation provides a more realistic assessment of code functionality, with GPT-4o performing best overall and Granite-family models offering competitive results, especially in Bash tasks; the work also highlights challenges in multi-line Bash and math-related prompts, guiding future improvements. This methodology and dataset pave the way for robust evaluation of NL-to-shell code in practical incident remediation and can extend to broader Bash/PowerShell testing beyond the APM context.
Abstract
Given recent advancements of Large Language Models (LLMs), code generation tasks attract immense attention for wide application in different domains. In an effort to evaluate and select a best model to automatically remediate system incidents discovered by Application Performance Monitoring (APM) platforms, it is crucial to verify if the generated code is syntactically and semantically correct, and whether it can be executed correctly as intended. However, current methods for evaluating the quality of code generated by LLMs heavily rely on surface form similarity metrics (e.g. BLEU, ROUGE, and exact/partial match) which have numerous limitations. In contrast, execution based evaluation focuses more on code functionality and does not constrain the code generation to any fixed solution. Nevertheless, designing and implementing such execution-based evaluation platform is not a trivial task. There are several works creating execution-based evaluation platforms for popular programming languages such as SQL, Python, Java, but limited or no attempts for scripting languages such as Bash and PowerShell. In this paper, we present the first execution-based evaluation platform in which we created three test suites (total 125 handcrafted test cases) to evaluate Bash (both single-line commands and multiple-line scripts) and PowerShell codes generated by LLMs. We benchmark seven closed and open-source LLMs using our platform with different techniques (zero-shot vs. few-shot learning).
