Exploring the True Potential: Evaluating the Black-box Optimization Capability of Large Language Models

Beichen Huang; Xingyu Wu; Yu Zhou; Jibin Wu; Liang Feng; Ran Cheng; Kay Chen Tan

Exploring the True Potential: Evaluating the Black-box Optimization Capability of Large Language Models

Beichen Huang, Xingyu Wu, Yu Zhou, Jibin Wu, Liang Feng, Ran Cheng, Kay Chen Tan

TL;DR

The paper systematically evaluates large language models (LLMs) as black-box optimizers across discrete and continuous tasks, revealing limited proficiency for pure numerical optimization due to string-based number handling and context-length limits. While LLMs struggle with numerical tasks, they show potential in broader optimization contexts by generating problem-specific heuristics from prompts and handling non-numeric problem descriptions. GPT-4 emerges as the strongest model among those tested, yet its performance remains sensitive to prompts and task structure, underscoring the need for careful prompt design and possible tool augmentation. The findings suggest cautious application of LLMs to numerical optimization and highlight promising directions in using LLMs for heuristic generation, non-numeric optimization, and integrating external numerical tools to overcome current limitations.

Abstract

Large language models (LLMs) have demonstrated exceptional performance not only in natural language processing tasks but also in a great variety of non-linguistic domains. In diverse optimization scenarios, there is also a rising trend of applying LLMs. However, whether the application of LLMs in the black-box optimization problems is genuinely beneficial remains unexplored. This paper endeavors to offer deep insights into the potential of LLMs in optimization through a comprehensive investigation, which covers both discrete and continuous optimization problems to assess the efficacy and distinctive characteristics that LLMs bring to this field. Our findings reveal both the limitations and advantages of LLMs in optimization. Specifically, on the one hand, despite the significant power consumed for running the models, LLMs exhibit subpar performance in pure numerical tasks, primarily due to a mismatch between the problem domain and their processing capabilities; on the other hand, although LLMs may not be ideal for traditional numerical optimization, their potential in broader optimization contexts remains promising, where LLMs exhibit the ability to solve problems in non-numerical domains and can leverage heuristics from the prompt to enhance their performance. To the best of our knowledge, this work presents the first systematic evaluation of LLMs for numerical optimization. Our findings pave the way for a deeper understanding of LLMs' role in optimization and guide future application of LLMs in a wide range of scenarios.

Exploring the True Potential: Evaluating the Black-box Optimization Capability of Large Language Models

TL;DR

Abstract

Paper Structure (14 sections, 9 figures, 2 tables)

This paper contains 14 sections, 9 figures, 2 tables.

Introduction
Related Work
Large Language Models
LLMs for Numerical Black-Box Optimization
Evaluation Settings
Model Settings
Prompt Settings
Problem Settings
Procedure Settings
Investigation and Analysis
Baseline Performance
Basic Properties
Advanced Properties
Conclusion and discussion

Figures (9)

Figure 1: An illustration of our evaluation on applying popular LLMs for black-box optimization. (a) Overall evaluation process. First, we will assess the baseline performance of different models through a series of simple tasks. These baseline experiments will help identify a subset of top-performing models, which will then be utilized to evaluate the essential properties of optimizers on LLMs in detail. We will begin by examining basic properties, which are typical of most optimizers, and subsequently move on to advanced properties that only LLMs may possess. (b) Process of evaluating each property. First, we will design a task that can reflect the property in question. Next, we will create a prompt template tailored to this task. Finally, we will employ multiple models to conduct the optimization process, thereby assessing their performance regarding the evaluated property.
Figure 2: The visualization of the landscape of numerical benchmark functions used in our investigation. To enhance clarity, the plotted region differs from the actual bounds utilized during the test. Additionally, the Rosenbrock function is depicted on a logarithmic scale to provide a more insightful representation.
Figure 3: Evaluation results of LLMs' capacity in comprehending string-represented numbers. We manipulate the number of decimal digits in the input to control its numerical precision.
Figure 4: Evaluation results of LLMs' scalability on problem dimensions. We scale the number of dimensions from 16 to 256 using a sphere function.
Figure 5: Evaluation results of LLMs' adaptability to different continuous optimization problems. Performance fluctuates drastically when the given input undergoes a shift. Notably, the tested models exhibit distinct responses to these shifts. The average performance over 5 runs is reported, as described in Section III.B.
...and 4 more figures

Exploring the True Potential: Evaluating the Black-box Optimization Capability of Large Language Models

TL;DR

Abstract

Exploring the True Potential: Evaluating the Black-box Optimization Capability of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)