Code World Models for Parameter Control in Evolutionary Algorithms

Camilo Chacón Sartori; Guillem Rodríguez Corominas

Code World Models for Parameter Control in Evolutionary Algorithms

Camilo Chacón Sartori, Guillem Rodríguez Corominas

TL;DR

This work extends Code World Models, LLM-synthesized Python programs that predict environment dynamics, to include CWM-greedy, a simulator of the optimizer's dynamics that outperforms DQN in sample efficiency, success rate, and generalization.

Abstract

Can an LLM learn how an optimizer behaves -- and use that knowledge to control it? We extend Code World Models (CWMs), LLM-synthesized Python programs that predict environment dynamics, from deterministic games to stochastic combinatorial optimization. Given suboptimal trajectories of $(1{+}1)$-$\text{RLS}_k$, the LLM synthesizes a simulator of the optimizer's dynamics; greedy planning over this simulator then selects the mutation strength $k$ at each step. On \lo{} and \onemax{}, CWM-greedy performs within 6\% of the theoretically optimal policy -- without ever seeing optimal-policy trajectories. On \jump{$_k$}, where a deceptive valley causes all adaptive baselines to fail (0\% success rate), CWM-greedy achieves 100\% success rate -- without any collection policy using oracle knowledge of the gap parameter. On the NK-Landscape, where no closed-form model exists, CWM-greedy outperforms all baselines across fifteen independently generated instances ($36.94$ vs.\ $36.32$; $p<0.001$) when the prompt includes empirical transition statistics. The CWM also outperforms DQN in sample efficiency (200 offline trajectories vs.\ 500 online episodes), success rate (100\% vs.\ 58\%), and generalization ($k{=}3$: 78\% vs.\ 0\%). Robustness experiments confirm stable synthesis across 5 independent runs.

Code World Models for Parameter Control in Evolutionary Algorithms

TL;DR

Abstract

, the LLM synthesizes a simulator of the optimizer's dynamics; greedy planning over this simulator then selects the mutation strength

at each step. On \lo{} and \onemax{}, CWM-greedy performs within 6\% of the theoretically optimal policy -- without ever seeing optimal-policy trajectories. On \jump{

}, where a deceptive valley causes all adaptive baselines to fail (0\% success rate), CWM-greedy achieves 100\% success rate -- without any collection policy using oracle knowledge of the gap parameter. On the NK-Landscape, where no closed-form model exists, CWM-greedy outperforms all baselines across fifteen independently generated instances (

vs.\

;

) when the prompt includes empirical transition statistics. The CWM also outperforms DQN in sample efficiency (200 offline trajectories vs.\ 500 online episodes), success rate (100\% vs.\ 58\%), and generalization (

: 78\% vs.\ 0\%). Robustness experiments confirm stable synthesis across 5 independent runs.

Paper Structure (21 sections, 7 figures, 9 tables, 1 algorithm)

This paper contains 21 sections, 7 figures, 9 tables, 1 algorithm.

Introduction
Related work.
Method: Applying CWMs to Parameter Control
Experimental Setup
Results
Unimodal Benchmarks: LeadingOnes & OneMax
Jump$_k$: The Deceptive Valley
Why adaptive baselines fail.
Why CWM-greedy succeeds.
Comparison with DQN
NK-Landscape: Beyond Mathematical Models
Generalization
Across problem sizes.
Across Jump$_k$ values.
CWM Quality
...and 6 more sections

Figures (7)

Figure 1: Overview of the CWM approach. Offline: trajectory data and a problem description are assembled into a prompt; an LLM synthesizes a Python world model in a single API call (${\sim}$€ 0.04). Online: the greedy planner queries the CWM at each step to select $k^*$. Bottom: key results across four benchmarks.
Figure 2: LeadingOnes: CWM score heatmap over (fitness $i$, parameter $k$). Black stars ($\star$) mark the optimal $k^*(i) = \lfloor n/(i+1) \rfloor$, snapped to the nearest column when $k^*$ is not in the displayed set. Stars appear only at $k$ values that are optimal for at least one fitness level; columns without stars (e.g. $k{=}20, 25, 30, 40$) are never greedy-optimal for any $i$.
Figure 3: Jump$_k$ ($k{=}2$): CWM score heatmap over (fitness, parameter $k$). Black stars ($\star$) mark the greedy-optimal $k^*(i)$, snapped to the nearest displayed column. At the valley edge (fitness $= n$), the CWM correctly predicts that only $k{=}2$ leads to improvement---opposite to all adaptive baselines, which decrease $k$ during stagnation.
Figure 4: Left: CWM vs. DQN on Jump$_k$: 100% success rate vs. 58%. Right: DQN learning curves ($k{=}2$). Training beyond 500 episodes degrades performance---success rate plateaus at ${\sim}50\%$ while steps decrease only marginally, revealing overfitting to the $\epsilon$-greedy exploration policy.
Figure 5: NK-Landscape synthesis (simplified). (A) The prompt provides an empirical transition table instead of a mathematical model. (B) The LLM encodes this table into a Python CWM; the greedy planner selects $k^* = \arg\max_k \texttt{evaluate}(\texttt{predict}(s, k))$ at each step.
...and 2 more figures

Code World Models for Parameter Control in Evolutionary Algorithms

TL;DR

Abstract

Code World Models for Parameter Control in Evolutionary Algorithms

Authors

TL;DR

Abstract

Table of Contents

Figures (7)