Table of Contents
Fetching ...

Greedy Is a Strong Default: Agents as Iterative Optimizers

Yitao Li

Abstract

Classical optimization algorithms--hill climbing, simulated annealing, population-based methods--generate candidate solutions via random perturbations. We replace the random proposal generator with an LLM agent that reasons about evaluation diagnostics to propose informed candidates, and ask: does the classical optimization machinery still help when the proposer is no longer random? We evaluate on four tasks spanning discrete, mixed, and continuous search spaces (all replicated across 3 independent runs): rule-based classification on Breast Cancer (test accuracy 86.0% to 96.5%), mixed hyperparameter optimization for MobileNetV3-Small on STL-10 (84.5% to 85.8%, zero catastrophic failures vs. 60% for random search), LoRA fine-tuning of Qwen2.5-0.5B on SST-2 (89.5% to 92.7%, matching Optuna TPE with 2x efficiency), and XGBoost on Adult Census (AUC 0.9297 to 0.9317, tying CMA-ES with 3x fewer evaluations). Empirically, on these tasks: a cross-task ablation shows that simulated annealing, parallel investigators, and even a second LLM model (OpenAI Codex) provide no benefit over greedy hill climbing while requiring 2-3x more evaluations. In our setting, the LLM's learned prior appears strong enough that acceptance-rule sophistication has limited impact--round 1 alone delivers the majority of improvement, and variants converge to similar configurations across strategies. The practical implication is surprising simplicity: greedy hill climbing with early stopping is a strong default. Beyond accuracy, the framework produces human-interpretable artifacts--the discovered cancer classification rules independently recapitulate established cytopathology principles.

Greedy Is a Strong Default: Agents as Iterative Optimizers

Abstract

Classical optimization algorithms--hill climbing, simulated annealing, population-based methods--generate candidate solutions via random perturbations. We replace the random proposal generator with an LLM agent that reasons about evaluation diagnostics to propose informed candidates, and ask: does the classical optimization machinery still help when the proposer is no longer random? We evaluate on four tasks spanning discrete, mixed, and continuous search spaces (all replicated across 3 independent runs): rule-based classification on Breast Cancer (test accuracy 86.0% to 96.5%), mixed hyperparameter optimization for MobileNetV3-Small on STL-10 (84.5% to 85.8%, zero catastrophic failures vs. 60% for random search), LoRA fine-tuning of Qwen2.5-0.5B on SST-2 (89.5% to 92.7%, matching Optuna TPE with 2x efficiency), and XGBoost on Adult Census (AUC 0.9297 to 0.9317, tying CMA-ES with 3x fewer evaluations). Empirically, on these tasks: a cross-task ablation shows that simulated annealing, parallel investigators, and even a second LLM model (OpenAI Codex) provide no benefit over greedy hill climbing while requiring 2-3x more evaluations. In our setting, the LLM's learned prior appears strong enough that acceptance-rule sophistication has limited impact--round 1 alone delivers the majority of improvement, and variants converge to similar configurations across strategies. The practical implication is surprising simplicity: greedy hill climbing with early stopping is a strong default. Beyond accuracy, the framework produces human-interpretable artifacts--the discovered cancer classification rules independently recapitulate established cytopathology principles.

Paper Structure

This paper contains 58 sections, 3 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Agentic rule-based classifier for breast cancer diagnosis. Rules are evaluated sequentially (top to bottom); the first matching rule determines the classification. The system discovered 5 rules using 4 features from FNA biopsy measurements, achieving 95.6% test accuracy with 341 training samples---comparable to a pruned decision tree (96%) trained on 455 samples. Italic annotations are clinical interpretations based on the FNA cytopathology literature; they were not provided to the agent---the rules were discovered from data alone.
  • Figure 2: Best validation metric vs. optimization round across all four tasks (3 independent runs each). Round 1 delivers the majority of improvement in every case. Subsequent rounds add diminishing gains before early stopping. Cancer panel shows greedy (blue) vs. SA (orange); SA Run 2 achieved high validation but lower test accuracy (0.956 vs. greedy best 0.965), illustrating that validation-level advantage does not transfer to test.
  • Figure 3: Per-trial test accuracy on STL-10 for random search, SMAC, and agentic greedy. Random search and SMAC produce catastrophic failures (test accuracy $< 0.15$) in 60% and 33% of trials. The agentic method produces zero failures across all evaluations, with tight clustering near the optimum.