Table of Contents
Fetching ...

Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

Abhinaba Basu, Pavan Chakraborty

Abstract

Scientific discovery increasingly relies on AI systems to select candidates for expensive experimental validation, yet no principled, budget-aware evaluation framework exists for comparing selection strategies -- a gap intensified by large language models (LLMs), which generate plausible scientific proposals without reliable downstream evaluation. We introduce the Budget-Sensitive Discovery Score (BSDS), a formally verified metric -- 20 theorems machine-checked by the Lean 4 proof assistant -- that jointly penalizes false discoveries (lambda-weighted FDR) and excessive abstention (gamma-weighted coverage gap) at each budget level. Its budget-averaged form, the Discovery Quality Score (DQS), provides a single summary statistic that no proposer can inflate by performing well at a cherry-picked budget. As a case study, we apply BSDS/DQS to: do LLMs add marginal value to an existing ML pipeline for drug discovery candidate selection? We evaluate 39 proposers -- 11 mechanistic variants, 14 zero-shot LLM configurations, and 14 few-shot LLM configurations -- using SMILES representations on MoleculeNet HIV (41,127 compounds, 3.5% active, 1,000 bootstrap replicates) under both random and scaffold splits. Three findings emerge. First, the simple RF-based Greedy-ML proposer achieves the best DQS (-0.046), outperforming all MLP variants and LLM configurations. Second, no LLM surpasses the Greedy-ML baseline under zero-shot or few-shot evaluation on HIV or Tox21, establishing that LLMs provide no marginal value over an existing trained classifier. Third, the proposer hierarchy generalizes across five MoleculeNet benchmarks spanning 0.18%-46.2% prevalence, a non-drug AV safety domain, and a 9x7 grid of penalty parameters (tau >= 0.636, mean tau = 0.863). The framework applies to any setting where candidates are selected under budget constraints and asymmetric error costs.

Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

Abstract

Scientific discovery increasingly relies on AI systems to select candidates for expensive experimental validation, yet no principled, budget-aware evaluation framework exists for comparing selection strategies -- a gap intensified by large language models (LLMs), which generate plausible scientific proposals without reliable downstream evaluation. We introduce the Budget-Sensitive Discovery Score (BSDS), a formally verified metric -- 20 theorems machine-checked by the Lean 4 proof assistant -- that jointly penalizes false discoveries (lambda-weighted FDR) and excessive abstention (gamma-weighted coverage gap) at each budget level. Its budget-averaged form, the Discovery Quality Score (DQS), provides a single summary statistic that no proposer can inflate by performing well at a cherry-picked budget. As a case study, we apply BSDS/DQS to: do LLMs add marginal value to an existing ML pipeline for drug discovery candidate selection? We evaluate 39 proposers -- 11 mechanistic variants, 14 zero-shot LLM configurations, and 14 few-shot LLM configurations -- using SMILES representations on MoleculeNet HIV (41,127 compounds, 3.5% active, 1,000 bootstrap replicates) under both random and scaffold splits. Three findings emerge. First, the simple RF-based Greedy-ML proposer achieves the best DQS (-0.046), outperforming all MLP variants and LLM configurations. Second, no LLM surpasses the Greedy-ML baseline under zero-shot or few-shot evaluation on HIV or Tox21, establishing that LLMs provide no marginal value over an existing trained classifier. Third, the proposer hierarchy generalizes across five MoleculeNet benchmarks spanning 0.18%-46.2% prevalence, a non-drug AV safety domain, and a 9x7 grid of penalty parameters (tau >= 0.636, mean tau = 0.863). The framework applies to any setting where candidates are selected under budget constraints and asymmetric error costs.
Paper Structure (93 sections, 4 equations, 7 figures, 13 tables)

This paper contains 93 sections, 4 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Evaluation pipeline for 39 proposers across five categories: baselines (2), mechanistic ablations (5), direct optimization (BSDS-Recursive), ablation controls (3), and real LLMs (28). All proposers are evaluated by the formally verified $\mathrm{BSDS}$/$\mathrm{DQS}$ metric with 1,000 bootstrap replicates. Multi-round proposers receive evaluation feedback (dashed arrow).
  • Figure 2: $\mathrm{BSDS}$ as a function of budget fraction $B/N$ for all 11 mechanistic proposers (2 baselines, 5 ablations, BSDS-Recursive, and 3 ablation controls), with 95% bootstrap confidence bands. Greedy-ML dominates all proposers across budget levels.
  • Figure 3: Bootstrap distributions of $\mathrm{DQS}$ for all mechanistic proposers (1,000 replicates). Greedy-ML has the lowest variance; Retrieval achieves the most favorable individual budget-level scores.
  • Figure 4: $\mathrm{DQS}$ as a function of softmax temperature $\tau$ for the Generative proposer. Performance degrades monotonically with increasing $\tau$; even the optimal $\tau^* = 0.1$ yields negative $\mathrm{DQS}$.
  • Figure 5: NL framing invariance. Dots show mean $\mathrm{DQS}$ across six requirement phrasings; horizontal bars show the min--max spread. All proposers except Informed-Prior exhibit zero spread ($\Delta < 0.01$), confirming that $\mathrm{DQS}$ is invariant to NL framing.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 1: Budget-Sensitive Discovery Score
  • Definition 2: Discovery Quality Score