Table of Contents
Fetching ...

LLM4Perf: Large Language Models Are Effective Samplers for Multi-Objective Performance Modeling

Xin Wang, Zhenhao Li, Zishuo Ding

TL;DR

This work explores using large language models to guide sampling for multi-objective performance modeling in configurable software. It presents LLM4Perf, which prunes the configuration space via documentation and refines sampling through a feedback loop, evaluated on four real-world systems with two performance metrics and multiple budgets. Results show LLM4Perf achieves the best predictive accuracy in a majority of scenarios and consistently outperforms baselines as objective dimensionality grows, with space pruning proving broadly beneficial and adaptive refinement crucial for handling complex trade-offs. The study also reveals how different LLM components contribute differently to effectiveness and provides practical guidance on model selection and hyperparameter settings for cost-effective deployment.

Abstract

The performance of modern software systems is critically dependent on their complex configuration options. Building accurate performance models to navigate this vast space requires effective sampling strategies, yet existing methods often struggle with multi-objective optimization and cannot leverage semantic information from documentation. The recent success of Large Language Models (LLMs) motivates the central question of this work: Can LLMs serve as effective samplers for multi-objective performance modeling? To explore this, we present a comprehensive empirical study investigating the capabilities and characteristics of LLM-driven sampling. We design and implement LLM4Perf, a feedback-based framework, and use it to systematically evaluate the LLM-guided sampling process across four highly configurable, real-world systems. Our study reveals that the LLM-guided approach outperforms traditional baselines in most cases. Quantitatively, LLM4Perf achieves the best performance in nearly 68.8% (77 out of 112) of all evaluation scenarios, demonstrating its superior effectiveness. We find this effectiveness stems from the LLM's dual capabilities of configuration space pruning and feedback-driven strategy refinement. The effectiveness of this pruning is further validated by the fact that it also improves the performance of the baseline methods in nearly 91.5% (410 out of 448) of cases. Furthermore, we show how the LLM choices for each component and hyperparameters within LLM4Perf affect its effectiveness. Overall, this paper provides strong evidence for the effectiveness of LLMs in performance engineering and offers concrete insights into the mechanisms that drive their success.

LLM4Perf: Large Language Models Are Effective Samplers for Multi-Objective Performance Modeling

TL;DR

This work explores using large language models to guide sampling for multi-objective performance modeling in configurable software. It presents LLM4Perf, which prunes the configuration space via documentation and refines sampling through a feedback loop, evaluated on four real-world systems with two performance metrics and multiple budgets. Results show LLM4Perf achieves the best predictive accuracy in a majority of scenarios and consistently outperforms baselines as objective dimensionality grows, with space pruning proving broadly beneficial and adaptive refinement crucial for handling complex trade-offs. The study also reveals how different LLM components contribute differently to effectiveness and provides practical guidance on model selection and hyperparameter settings for cost-effective deployment.

Abstract

The performance of modern software systems is critically dependent on their complex configuration options. Building accurate performance models to navigate this vast space requires effective sampling strategies, yet existing methods often struggle with multi-objective optimization and cannot leverage semantic information from documentation. The recent success of Large Language Models (LLMs) motivates the central question of this work: Can LLMs serve as effective samplers for multi-objective performance modeling? To explore this, we present a comprehensive empirical study investigating the capabilities and characteristics of LLM-driven sampling. We design and implement LLM4Perf, a feedback-based framework, and use it to systematically evaluate the LLM-guided sampling process across four highly configurable, real-world systems. Our study reveals that the LLM-guided approach outperforms traditional baselines in most cases. Quantitatively, LLM4Perf achieves the best performance in nearly 68.8% (77 out of 112) of all evaluation scenarios, demonstrating its superior effectiveness. We find this effectiveness stems from the LLM's dual capabilities of configuration space pruning and feedback-driven strategy refinement. The effectiveness of this pruning is further validated by the fact that it also improves the performance of the baseline methods in nearly 91.5% (410 out of 448) of cases. Furthermore, we show how the LLM choices for each component and hyperparameters within LLM4Perf affect its effectiveness. Overall, this paper provides strong evidence for the effectiveness of LLMs in performance engineering and offers concrete insights into the mechanisms that drive their success.

Paper Structure

This paper contains 19 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An overview of our study.
  • Figure 2: Scalability comparison (RMSE) of LLM4Perf (blue line) against multi-objective baselines (NSGA-III, EHVI, TSEMO) on LRZIP, varying the number of target objectives (2, 3, and 4). Lower RMSE is better. "N/A" marks metrics not included.
  • Figure 3: (RQ4) Prediction error (RMSE) of the XGBoost across different sample sizes and numbers of candidate configurations per iteration ($N_{\text{candidates}}$). Lighter shading indicates lower error, reflecting better predictive performance.
  • Figure 4: (RQ4) Influence of the number of parallel Configuration Generators ($N_{\text{generators}}$) on the final RMSE. Box plots summarize 10 independent runs for XGBoost.