Closed-Form Test Functions for Biophysical Sequence Optimization Algorithms
Samuel Stanton, Robert Alberstein, Nathan Frey, Andrew Watkins, Kyunghyun Cho
TL;DR
The paper addresses the lack of accessible benchmarks for biophysical sequence optimization by introducing Ehrlich functions, a closed-form, tunable benchmark family that encapsulates core geometric features such as feasibility constraints and non-additive motif interactions. It argues that well-designed benchmarks should be low-cost, well-characterized, and challenging, while remaining relevant to real applications, and it situates Ehrlich functions alongside other benchmark types. Through a simple genetic algorithm, the authors demonstrate how Ehrlich parameters control difficulty and how optimizer hyperparameters influence performance, including a Bayesian-optimization-led tuning that yields gains with trade-offs in feasibility. The work offers a practical, reproducible framework for rapid, scalable evaluation of optimization algorithms in biophysical design, with implications for deeper exploration of generative search strategies and future benchmark development.
Abstract
There is a growing body of work seeking to replicate the success of machine learning (ML) on domains like computer vision (CV) and natural language processing (NLP) to applications involving biophysical data. One of the key ingredients of prior successes in CV and NLP was the broad acceptance of difficult benchmarks that distilled key subproblems into approachable tasks that any junior researcher could investigate, but good benchmarks for biophysical domains are rare. This scarcity is partially due to a narrow focus on benchmarks which simulate biophysical data; we propose instead to carefully abstract biophysical problems into simpler ones with key geometric similarities. In particular we propose a new class of closed-form test functions for biophysical sequence optimization, which we call Ehrlich functions. We provide empirical results demonstrating these functions are interesting objects of study and can be non-trivial to solve with a standard genetic optimization baseline.
