SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design

David van Dijk; Ivan Vrkic

SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design

David van Dijk, Ivan Vrkic

Abstract

Many of the most important problems in science and engineering are inverse problems: given a desired outcome, find a design that achieves it. Evaluating whether a candidate meets the spec is often routine; a binding energy can be computed, a reactor yield simulated, a pharmacokinetic profile predicted. But searching a combinatorial design space for inputs that satisfy those targets is fundamentally harder. We introduce SciDesignBench, a benchmark of 520 simulator-grounded tasks across 14 scientific domains and five settings spanning single-shot design, short-horizon feedback, long-horizon refinement, and seed-design optimization. On the 10-domain shared-core subset, the best zero-shot model reaches only 29.0% success despite substantially higher parse rates. Simulator feedback helps, but the leaderboard changes with horizon: Sonnet 4.5 is strongest in one-turn de novo design, whereas Opus 4.6 is strongest after 20 turns of simulator-grounded refinement. Providing a starting seed design reshuffles the leaderboard again, demonstrating that constrained modification requires a fundamentally different capability from unconstrained de novo generation. We then introduce RLSF, a simulator-feedback training recipe. An RLSF-tuned 8B model raises single-turn success rates by 8-17 percentage points across three domains. Together, these results position simulator-grounded inverse design as both a benchmark for scientific reasoning and a practical substrate for amortizing expensive test-time compute into model weights.

SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design

Abstract

Paper Structure (55 sections, 3 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 55 sections, 3 equations, 5 figures, 5 tables, 1 algorithm.

Introduction
Contributions.
Related Work
Inverse problems and inverse design.
Science benchmarks.
General LLM benchmarks.
Scientific LLMs.
RL with execution or simulator feedback.
Classical and learned inverse design.
SciDesignBench: The Benchmark
Task Formulation
Domains
Oracle Fidelity and Execution Cost
Difficulty Levels
Evaluation Metrics and Protocol
...and 40 more sections

Figures (5)

Figure 1: SciDesignBench overview. (a) Each benchmark task pairs a natural-language design goal with a ground-truth forward simulator. We evaluate frontier LLMs across five settings and train small models via RLSF. (b) Frontier LLMs achieve only 12--26% zero-shot; simulator feedback helps but saturates below 68%. (c) RLSF training improves an 8B model on three case studies: ADMET optimization (30%$\to$41%), PK/PD dosing (24%$\to$36%), and docking (42%$\to$59% with real AutoDock Vina).
Figure 2: Aggregate success rate vs. turn number in the 20-turn long-horizon setting, averaged across the 10 manifest-defined de novo shared-core domains. Simulator feedback provides rapid initial gains (turns 1--5) but exhibits diminishing returns thereafter. Even at 20 turns, models plateau well below full benchmark saturation.
Figure 3: Success rate vs. turn number in the 20-turn long-horizon setting, per domain. Each line represents one model. Perturbation converges quickly; Controls and SSA show sustained gains through 20 turns; domains such as FBA remain highly model-dependent even with longer trajectories.
Figure 4: Feedback utilization is a distinct capability from static knowledge. Zero-shot success rate (x-axis) vs. feedback utilization (20-turn minus zero-shot, y-axis) for each model. Models with similar one-turn performance can have vastly different long-horizon gains---e.g., Sonnet 4.6 and Gemini 3.1 Pro start within a few points but Sonnet 4.6 realizes a much larger improvement under iterative simulator feedback.
Figure 5: RLSF training across three scientific design tasks. (a) ADMET optimization: curriculum GRPO improves from 30% (SFT) to 41% (GRPO), narrowly exceeding the strongest frontier ADMET optimization benchmark result. (b) PK/PD: success improves from 24% to 36% in de novo design and from 32% to 47% in optimization, yielding a competitive optimization result on a coupled dosing-control task. (c) Docking optimization: real AutoDock Vina improves from 42% (SFT) to 59% (GRPO) in the artifact-backed run used in the paper. Dashed vertical lines mark curriculum phase transitions. Points show raw eval; curves are smoothed.

SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design

Abstract

SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design

Authors

Abstract

Table of Contents

Figures (5)