Table of Contents
Fetching ...

MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

Wanhao Liu, Zonglin Yang, Jue Wang, Lidong Bing, Di Zhang, Dongzhan Zhou, Yuqiang Li, Houqiang Li, Erik Cambria, Wanli Ouyang

TL;DR

This work addresses the bottleneck of scalable experimental feedback in hypothesis ranking for costly natural-science experiments. It introduces CSX-Sim, a domain-grounded simulator built on three foundations (A1, P1, D1) to model hypothesis performance as a distance-based function to a hidden ground truth, plus an in-context reinforcement learning framework CSX-Rank that, through clustering of functional components, prioritizes hypotheses to test under limited trials. The authors validate the simulator against 124 real-world experiments, demonstrating strong trend alignment and robustness, and show that CSX-Rank substantially outperforms pre-experiment baselines and ablations on TOMATO-chem data. The integrated CSX toolkit enables systematic study of experiment-guided ranking and delivers a strong proof of concept for feedback-informed discovery. Overall, the approach promises to reduce wet-lab costs, accelerate material and drug discovery, and provide interpretable, auditable decision pipelines for scientific inquiry.

Abstract

Hypothesis ranking is vital for automated scientific discovery, especially in cost-intensive, throughput-limited natural science domains. Current methods focus on pre-experiment ranking, relying solely on language model reasoning without empirical feedback. We introduce experiment-guided ranking, which prioritizes hypotheses based on feedback from prior tests. Due to the impracticality of real experiments, we propose a simulator grounded in domain-specific concepts that models hypothesis performance as a function of similarity to a hidden ground truth, perturbed by noise. Validated against 124 hypotheses with experimentally reported outcomes, the simulator approximates real results with consistent trend alignment. Although deviations exist, they mimic wet-lab noise, promoting more robust ranking strategies. We frame experiment-guided ranking as a sequential decision-making problem and propose an in-context reinforcement learning (ICRL) framework. Our LLM-based policy decomposes hypotheses into functional elements, clusters them by mechanistic roles, and prioritizes recombinations based on feedback. Experiments show our approach significantly outperforms pre-experiment baselines and strong ablations. Our toolkit, comprising the simulator and ICRL framework, enables systematic research on experiment-guided ranking, with the policy serving as a strong proof of concept.

MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

TL;DR

This work addresses the bottleneck of scalable experimental feedback in hypothesis ranking for costly natural-science experiments. It introduces CSX-Sim, a domain-grounded simulator built on three foundations (A1, P1, D1) to model hypothesis performance as a distance-based function to a hidden ground truth, plus an in-context reinforcement learning framework CSX-Rank that, through clustering of functional components, prioritizes hypotheses to test under limited trials. The authors validate the simulator against 124 real-world experiments, demonstrating strong trend alignment and robustness, and show that CSX-Rank substantially outperforms pre-experiment baselines and ablations on TOMATO-chem data. The integrated CSX toolkit enables systematic study of experiment-guided ranking and delivers a strong proof of concept for feedback-informed discovery. Overall, the approach promises to reduce wet-lab costs, accelerate material and drug discovery, and provide interpretable, auditable decision pipelines for scientific inquiry.

Abstract

Hypothesis ranking is vital for automated scientific discovery, especially in cost-intensive, throughput-limited natural science domains. Current methods focus on pre-experiment ranking, relying solely on language model reasoning without empirical feedback. We introduce experiment-guided ranking, which prioritizes hypotheses based on feedback from prior tests. Due to the impracticality of real experiments, we propose a simulator grounded in domain-specific concepts that models hypothesis performance as a function of similarity to a hidden ground truth, perturbed by noise. Validated against 124 hypotheses with experimentally reported outcomes, the simulator approximates real results with consistent trend alignment. Although deviations exist, they mimic wet-lab noise, promoting more robust ranking strategies. We frame experiment-guided ranking as a sequential decision-making problem and propose an in-context reinforcement learning (ICRL) framework. Our LLM-based policy decomposes hypotheses into functional elements, clusters them by mechanistic roles, and prioritizes recombinations based on feedback. Experiments show our approach significantly outperforms pre-experiment baselines and strong ablations. Our toolkit, comprising the simulator and ICRL framework, enables systematic research on experiment-guided ranking, with the policy serving as a strong proof of concept.

Paper Structure

This paper contains 39 sections, 12 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Overview of ranking strategies. Pre-experiment ranking is stateless and ignores feedback. Experiment-guided ranking with real experiments is stateful but infeasible to scale. Our simulator enables efficient development of ranking policies through simulated feedback before real deployment.
  • Figure 2: Illustration of the three conceptual foundations (A1–P1–D1) for simulator construction.
  • Figure 3: The internal structure of the simulator.
  • Figure 4: Experiment-guided ranking policy within an in-context reinforcement learning framework.
  • Figure 5: A Framework for Extracting Chemical Components in the Simulator.
  • ...and 1 more figures