Steering an Active Learning Workflow Towards Novel Materials Discovery via Queue Prioritization
Marcus Schwarting, Logan Ward, Nathaniel Hudson, Xiaoli Yan, Ben Blaiszik, Santanu Chaudhuri, Eliu Huerta, Ian Foster
TL;DR
This work tackles the inefficiency and potential degradation of generative AI in inverse design by coupling a queue prioritization scheme with an active-learning surrogate to steer candidate selection. The approach retrains a surrogate predictor on new simulation data and uses an acquisition function to reorder the generation queue, promoting high-quality, novel MOF candidates while discarding implausible ones. Applied to MOF-based carbon capture, the method increases high-performing candidates from 281 to 604 out of 1000 generated, with only marginal compute overhead (≈0.6%). The results demonstrate improved exploration–exploitation balance and offer a practical, generalizable strategy to boost discovery in complex design spaces while mitigating model decay. This has direct implications for accelerating materials discovery in high-throughput screening pipelines where expensive evaluations constrain exploration.
Abstract
Generative AI poses both opportunities and risks for solving inverse design problems in the sciences. Generative tools provide the ability to expand and refine a search space autonomously, but do so at the cost of exploring low-quality regions until sufficiently fine tuned. Here, we propose a queue prioritization algorithm that combines generative modeling and active learning in the context of a distributed workflow for exploring complex design spaces. We find that incorporating an active learning model to prioritize top design candidates can prevent a generative AI workflow from expending resources on nonsensical candidates and halt potential generative model decay. For an existing generative AI workflow for discovering novel molecular structure candidates for carbon capture, our active learning approach significantly increases the number of high-quality candidates identified by the generative model. We find that, out of 1000 novel candidates, our workflow without active learning can generate an average of 281 high-performing candidates, while our proposed prioritization with active learning can generate an average 604 high-performing candidates.
