Table of Contents
Fetching ...

Large Language Model-Based Benchmarking Experiment Settings for Evolutionary Multi-Objective Optimization

Lie Meng Pang, Hisao Ishibuchi

TL;DR

This work investigates what implicit knowledge LLMs bring to EMO algorithm benchmarking by prompting two models to design benchmarking experiments. It reveals a convergence on classical settings: NSGA-II, MOEA/D, and NSGA-III tested on ZDT, DTLZ, and WFG with HV and IGD as standard indicators, alongside common parameter choices such as reference-point specifications and population sizes. The analysis highlights gaps between LLM guidance and principled practice, including the use of a fixed HV reference point $r=(1.1,...,1.1)$ and large IGD reference sets, and notes model-specific differences in preferred algorithms and test problems. Overall, the study demonstrates that LLMs tend to reproduce established benchmarking norms, underscoring potential biases in automated EMO algorithm design and the need for more robust, theory-grounded guidance in LLM-assisted workflows.

Abstract

When we manually design an evolutionary optimization algorithm, we implicitly or explicitly assume a set of target optimization problems. In the case of automated algorithm design, target optimization problems are usually explicitly shown. Recently, the use of large language models (LLMs) for the design of evolutionary multi-objective optimization (EMO) algorithms have been examined in some studies. In those studies, target multi-objective problems are not always explicitly shown. It is well known in the EMO community that the performance evaluation results of EMO algorithms depend on not only test problems but also many other factors such as performance indicators, reference point, termination condition, and population size. Thus, it is likely that the designed EMO algorithms by LLMs depends on those factors. In this paper, we try to examine the implicit assumption about the performance comparison of EMO algorithms in LLMs. For this purpose, we ask LLMs to design a benchmarking experiment of EMO algorithms. Our experiments show that LLMs often suggest classical benchmark settings: Performance examination of NSGA-II, MOEA/D and NSGA-III on ZDT, DTLZ and WFG by HV and IGD under the standard parameter specifications.

Large Language Model-Based Benchmarking Experiment Settings for Evolutionary Multi-Objective Optimization

TL;DR

This work investigates what implicit knowledge LLMs bring to EMO algorithm benchmarking by prompting two models to design benchmarking experiments. It reveals a convergence on classical settings: NSGA-II, MOEA/D, and NSGA-III tested on ZDT, DTLZ, and WFG with HV and IGD as standard indicators, alongside common parameter choices such as reference-point specifications and population sizes. The analysis highlights gaps between LLM guidance and principled practice, including the use of a fixed HV reference point and large IGD reference sets, and notes model-specific differences in preferred algorithms and test problems. Overall, the study demonstrates that LLMs tend to reproduce established benchmarking norms, underscoring potential biases in automated EMO algorithm design and the need for more robust, theory-grounded guidance in LLM-assisted workflows.

Abstract

When we manually design an evolutionary optimization algorithm, we implicitly or explicitly assume a set of target optimization problems. In the case of automated algorithm design, target optimization problems are usually explicitly shown. Recently, the use of large language models (LLMs) for the design of evolutionary multi-objective optimization (EMO) algorithms have been examined in some studies. In those studies, target multi-objective problems are not always explicitly shown. It is well known in the EMO community that the performance evaluation results of EMO algorithms depend on not only test problems but also many other factors such as performance indicators, reference point, termination condition, and population size. Thus, it is likely that the designed EMO algorithms by LLMs depends on those factors. In this paper, we try to examine the implicit assumption about the performance comparison of EMO algorithms in LLMs. For this purpose, we ask LLMs to design a benchmarking experiment of EMO algorithms. Our experiments show that LLMs often suggest classical benchmark settings: Performance examination of NSGA-II, MOEA/D and NSGA-III on ZDT, DTLZ and WFG by HV and IGD under the standard parameter specifications.

Paper Structure

This paper contains 17 sections, 2 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Pareto fronts of the DTLZ2, Minus-DTLZ2, RWA2, and RWA6 problems.
  • Figure 2: (Near) Optimal HV distribution for three- and five-objective DTLZ1 and Minus-DTLZ1 with different specifications of reference point. For the five-objective problems, the solutions are projected into the $f_1-f_2-f_3$ space for better visualization.
  • Figure 3: Two Pareto-optimal solution sets with different solution distributions for the normalized ten-objective Minus-DTLZ1 problem, each containing 275 solutions.