Table of Contents
Fetching ...

Simple Baselines are Competitive with Code Evolution

Yonatan Gideoni, Sebastian Risi, Yarin Gal

TL;DR

This work test how well two simple baselines do over three domains: finding better mathematical bounds, designing agentic scaffolds, and machine learning competitions, and finds that simple baselines match or exceed much more sophisticated methods in all three.

Abstract

Code evolution is a family of techniques that rely on large language models to search through possible computer programs by evolving or mutating existing code. Many proposed code evolution pipelines show impressive performance but are often not compared to simpler baselines. We test how well two simple baselines do over three domains: finding better mathematical bounds, designing agentic scaffolds, and machine learning competitions. We find that simple baselines match or exceed much more sophisticated methods in all three. By analyzing these results we find various shortcomings in how code evolution is both developed and used. For the mathematical bounds, a problem's search space and domain knowledge in the prompt are chiefly what dictate a search's performance ceiling and efficiency, with the code evolution pipeline being secondary. Thus, the primary challenge in finding improved bounds is designing good search spaces, which is done by domain experts, and not the search itself. When designing agentic scaffolds we find that high variance in the scaffolds coupled with small datasets leads to suboptimal scaffolds being selected, resulting in hand-designed majority vote scaffolds performing best. We propose better evaluation methods that reduce evaluation stochasticity while keeping the code evolution economically feasible. We finish with a discussion of avenues and best practices to enable more rigorous code evolution in future work.

Simple Baselines are Competitive with Code Evolution

TL;DR

This work test how well two simple baselines do over three domains: finding better mathematical bounds, designing agentic scaffolds, and machine learning competitions, and finds that simple baselines match or exceed much more sophisticated methods in all three.

Abstract

Code evolution is a family of techniques that rely on large language models to search through possible computer programs by evolving or mutating existing code. Many proposed code evolution pipelines show impressive performance but are often not compared to simpler baselines. We test how well two simple baselines do over three domains: finding better mathematical bounds, designing agentic scaffolds, and machine learning competitions. We find that simple baselines match or exceed much more sophisticated methods in all three. By analyzing these results we find various shortcomings in how code evolution is both developed and used. For the mathematical bounds, a problem's search space and domain knowledge in the prompt are chiefly what dictate a search's performance ceiling and efficiency, with the code evolution pipeline being secondary. Thus, the primary challenge in finding improved bounds is designing good search spaces, which is done by domain experts, and not the search itself. When designing agentic scaffolds we find that high variance in the scaffolds coupled with small datasets leads to suboptimal scaffolds being selected, resulting in hand-designed majority vote scaffolds performing best. We propose better evaluation methods that reduce evaluation stochasticity while keeping the code evolution economically feasible. We finish with a discussion of avenues and best practices to enable more rigorous code evolution in future work.
Paper Structure (24 sections, 4 equations, 7 figures, 5 tables)

This paper contains 24 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) For AlphaEvolve-style mathematical bounds each problem effectively defines a function that gets as input a list of numbers, possibly with some extra structure, and outputs a mathematical bound that should be maximized or minimized. (b,c) The two baselines, (b) randomly IID sampling a set of programs from an LLM and picking the best one and (c) generating a set of programs, evaluating them, and generating a new set conditioned on some of those that ran successfully. Number of generated programs is only for the setup when searching over mathematical bounds.
  • Figure 2: Average probability of matching or exceeding ShinkaEvolve over the 9 problems for our two baselines. Numbers on the right are the probabilities at the max budget of $20. Per problem breakdown is in Appendix \ref{['app:per_prob_p_imp']}. Both baselines perform well over different budgets, with sequential conditioned sampling (SCS) generally outperforming ShinkaEvolve. Shaded regions are asymmetric 95% confidence intervals, see Appendix \ref{['app:bootstrap']} for details.
  • Figure 3: AIME 2024 and 2025 accuracies for different methods. 2024 was used as the validation set, with 2025 serving as a test set. Majority@5/10 indicate manually designed majority vote scaffolds. (Left) Validation accuracies are those measured while evolving different scaffolds, with both validation and test accuracies being for 3 evaluations over the dataset. ShinkaEvolve numbers are from lange2025shinkaevolve. IID RS seemingly performs best out of the search methods, although all do worse than majority vote on the test set, with large drops in accuracy. (Right) Results when re-evaluating the scaffolds 10 times, with whiskers denoting 95% confidence intervals. Validation accuracies are lower for all automated search methods as their scaffolds are seemingly selected moreso due to stochasticity in the evaluations than them achieving good performance. Unlike when evaluating only 3 times, here it is apparent that there is no clear difference between ShinkaEvolve and IID RS, with the probability of improvement being $P(\text{Shinka}>\text{IID RS})=0.49$. Using an evaluation cascade with IID RS results in selecting a better scaffold and one that generalizes more, being essentially equal to majority vote@10 on the test set (49.5% probability of improvement).
  • Figure 4: Empirical distributions of majority vote@5 accuracies for AIME 2025 when evaluating each question a different number of times. Even when evaluating 10 times there is a standard deviation of more than 1% in the accuracy. Distributions were calculated by sampling 100 answers to each question and bootstrapping.
  • Figure 5: Per-problem probability of matching/exceeding ShinkaEvolve for the two baselines. Shaded regions are 95% confidence intervals.
  • ...and 2 more figures