CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents
Peter Jansen, Samiah Hassan, Pragnya Narasimha
TL;DR
CodeDistiller addresses the limitation that automated scientific discovery agents rely too heavily on parametric knowledge by automatically distilling large volumes of GitHub repositories into vetted, working domain-specific code examples. The system identifies repository purpose, selects relevant files, and uses LLMs to generate and iteratively debug executable code, tested in a materials-science setting with multiple base models. Across 250 repositories, distillation success reaches up to 74% for the best models, and downstream evaluations show augmented agents produce more accurate, complete, and scientifically sound experiments than baselines. The work demonstrates scalable, automated construction of reusable code libraries for Code-RAG agents and provides open-source tooling to advance automated scientific coding.
Abstract
Automated Scientific Discovery (ASD) systems can help automatically generate and run code-based experiments, but their capabilities are limited by the code they can reliably generate from parametric knowledge alone. As a result, current systems either mutate a small number of manually-crafted experiment examples, or operate solely from parametric knowledge, limiting quality and reach. We introduce CodeDistiller, a system that automatically distills large collections of scientific Github repositories into a vetted library of working domain-specific code examples, allowing ASD agents to expand their capabilities without manual effort. Using a combination of automatic and domain-expert evaluation on 250 materials science repositories, we find the best model is capable of producing functional examples for 74% of repositories, while our downstream evaluation shows an ASD agent augmented with a CodeDistiller generated library produces more accurate, complete, and scientifically sound experiments than an agent with only general materials-science code examples.
