Table of Contents
Fetching ...

Code World Models for Parameter Control in Evolutionary Algorithms

Camilo Chacón Sartori, Guillem Rodríguez Corominas

TL;DR

This work extends Code World Models, LLM-synthesized Python programs that predict environment dynamics, to include CWM-greedy, a simulator of the optimizer's dynamics that outperforms DQN in sample efficiency, success rate, and generalization.

Abstract

Can an LLM learn how an optimizer behaves -- and use that knowledge to control it? We extend Code World Models (CWMs), LLM-synthesized Python programs that predict environment dynamics, from deterministic games to stochastic combinatorial optimization. Given suboptimal trajectories of $(1{+}1)$-$\text{RLS}_k$, the LLM synthesizes a simulator of the optimizer's dynamics; greedy planning over this simulator then selects the mutation strength $k$ at each step. On \lo{} and \onemax{}, CWM-greedy performs within 6\% of the theoretically optimal policy -- without ever seeing optimal-policy trajectories. On \jump{$_k$}, where a deceptive valley causes all adaptive baselines to fail (0\% success rate), CWM-greedy achieves 100\% success rate -- without any collection policy using oracle knowledge of the gap parameter. On the NK-Landscape, where no closed-form model exists, CWM-greedy outperforms all baselines across fifteen independently generated instances ($36.94$ vs.\ $36.32$; $p<0.001$) when the prompt includes empirical transition statistics. The CWM also outperforms DQN in sample efficiency (200 offline trajectories vs.\ 500 online episodes), success rate (100\% vs.\ 58\%), and generalization ($k{=}3$: 78\% vs.\ 0\%). Robustness experiments confirm stable synthesis across 5 independent runs.

Code World Models for Parameter Control in Evolutionary Algorithms

TL;DR

This work extends Code World Models, LLM-synthesized Python programs that predict environment dynamics, to include CWM-greedy, a simulator of the optimizer's dynamics that outperforms DQN in sample efficiency, success rate, and generalization.

Abstract

Can an LLM learn how an optimizer behaves -- and use that knowledge to control it? We extend Code World Models (CWMs), LLM-synthesized Python programs that predict environment dynamics, from deterministic games to stochastic combinatorial optimization. Given suboptimal trajectories of -, the LLM synthesizes a simulator of the optimizer's dynamics; greedy planning over this simulator then selects the mutation strength at each step. On \lo{} and \onemax{}, CWM-greedy performs within 6\% of the theoretically optimal policy -- without ever seeing optimal-policy trajectories. On \jump{}, where a deceptive valley causes all adaptive baselines to fail (0\% success rate), CWM-greedy achieves 100\% success rate -- without any collection policy using oracle knowledge of the gap parameter. On the NK-Landscape, where no closed-form model exists, CWM-greedy outperforms all baselines across fifteen independently generated instances ( vs.\ ; ) when the prompt includes empirical transition statistics. The CWM also outperforms DQN in sample efficiency (200 offline trajectories vs.\ 500 online episodes), success rate (100\% vs.\ 58\%), and generalization (: 78\% vs.\ 0\%). Robustness experiments confirm stable synthesis across 5 independent runs.
Paper Structure (21 sections, 7 figures, 9 tables, 1 algorithm)

This paper contains 21 sections, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the CWM approach. Offline: trajectory data and a problem description are assembled into a prompt; an LLM synthesizes a Python world model in a single API call (${\sim}$€ 0.04). Online: the greedy planner queries the CWM at each step to select $k^*$. Bottom: key results across four benchmarks.
  • Figure 2: LeadingOnes: CWM score heatmap over (fitness $i$, parameter $k$). Black stars ($\star$) mark the optimal $k^*(i) = \lfloor n/(i+1) \rfloor$, snapped to the nearest column when $k^*$ is not in the displayed set. Stars appear only at $k$ values that are optimal for at least one fitness level; columns without stars (e.g. $k{=}20, 25, 30, 40$) are never greedy-optimal for any $i$.
  • Figure 3: Jump$_k$ ($k{=}2$): CWM score heatmap over (fitness, parameter $k$). Black stars ($\star$) mark the greedy-optimal $k^*(i)$, snapped to the nearest displayed column. At the valley edge (fitness $= n$), the CWM correctly predicts that only $k{=}2$ leads to improvement---opposite to all adaptive baselines, which decrease $k$ during stagnation.
  • Figure 4: Left: CWM vs. DQN on Jump$_k$: 100% success rate vs. 58%. Right: DQN learning curves ($k{=}2$). Training beyond 500 episodes degrades performance---success rate plateaus at ${\sim}50\%$ while steps decrease only marginally, revealing overfitting to the $\epsilon$-greedy exploration policy.
  • Figure 5: NK-Landscape synthesis (simplified). (A) The prompt provides an empirical transition table instead of a mathematical model. (B) The LLM encodes this table into a Python CWM; the greedy planner selects $k^* = \arg\max_k \texttt{evaluate}(\texttt{predict}(s, k))$ at each step.
  • ...and 2 more figures