Identifying Performance-Sensitive Configurations in Software Systems through Code Analysis with LLM Agents
Zehao Wang, Dong Jae Kim, Tse-Hsun Chen
TL;DR
PerfSense tackles the problem of identifying performance-sensitive configurations in large software systems by using an LLM-based two-agent setup (DevAgent and PerfAgent) that iteratively analyzes configuration-relevant code with prompt chaining and retrieval-augmented generation. The approach is zero-shot and unsupervised, designed to minimize manual effort while handling large codebases via call-graph analysis and document retrieval. Empirical evaluation on seven open-source Java systems shows PerfSense achieving an average accuracy of $64.77\%$, outperforming the state-of-the-art DiagConfig and a ChatGPT baseline, with notable gains in recall when using prompt chaining. The results also include a detailed misclassification analysis and discuss practical considerations for adopting LLM-based code analysis in software performance engineering.
Abstract
Configuration settings are essential for tailoring software behavior to meet specific performance requirements. However, incorrect configurations are widespread, and identifying those that impact system performance is challenging due to the vast number and complexity of possible settings. In this work, we present PerfSense, a lightweight framework that leverages Large Language Models (LLMs) to efficiently identify performance-sensitive configurations with minimal overhead. PerfSense employs LLM agents to simulate interactions between developers and performance engineers using advanced prompting techniques such as prompt chaining and retrieval-augmented generation (RAG). Our evaluation of seven open-source Java systems demonstrates that PerfSense achieves an average accuracy of 64.77% in classifying performance-sensitive configurations, outperforming both our LLM baseline (50.36%) and the previous state-of-the-art method (61.75%). Notably, our prompt chaining technique improves recall by 10% to 30% while maintaining similar precision levels. Additionally, a manual analysis of 362 misclassifications reveals common issues, including LLMs' misunderstandings of requirements (26.8%). In summary, PerfSense significantly reduces manual effort in classifying performance-sensitive configurations and offers valuable insights for future LLM-based code analysis research.
