Table of Contents
Fetching ...

CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents

Peter Jansen, Samiah Hassan, Pragnya Narasimha

TL;DR

CodeDistiller addresses the limitation that automated scientific discovery agents rely too heavily on parametric knowledge by automatically distilling large volumes of GitHub repositories into vetted, working domain-specific code examples. The system identifies repository purpose, selects relevant files, and uses LLMs to generate and iteratively debug executable code, tested in a materials-science setting with multiple base models. Across 250 repositories, distillation success reaches up to 74% for the best models, and downstream evaluations show augmented agents produce more accurate, complete, and scientifically sound experiments than baselines. The work demonstrates scalable, automated construction of reusable code libraries for Code-RAG agents and provides open-source tooling to advance automated scientific coding.

Abstract

Automated Scientific Discovery (ASD) systems can help automatically generate and run code-based experiments, but their capabilities are limited by the code they can reliably generate from parametric knowledge alone. As a result, current systems either mutate a small number of manually-crafted experiment examples, or operate solely from parametric knowledge, limiting quality and reach. We introduce CodeDistiller, a system that automatically distills large collections of scientific Github repositories into a vetted library of working domain-specific code examples, allowing ASD agents to expand their capabilities without manual effort. Using a combination of automatic and domain-expert evaluation on 250 materials science repositories, we find the best model is capable of producing functional examples for 74% of repositories, while our downstream evaluation shows an ASD agent augmented with a CodeDistiller generated library produces more accurate, complete, and scientifically sound experiments than an agent with only general materials-science code examples.

CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents

TL;DR

CodeDistiller addresses the limitation that automated scientific discovery agents rely too heavily on parametric knowledge by automatically distilling large volumes of GitHub repositories into vetted, working domain-specific code examples. The system identifies repository purpose, selects relevant files, and uses LLMs to generate and iteratively debug executable code, tested in a materials-science setting with multiple base models. Across 250 repositories, distillation success reaches up to 74% for the best models, and downstream evaluations show augmented agents produce more accurate, complete, and scientifically sound experiments than baselines. The work demonstrates scalable, automated construction of reusable code libraries for Code-RAG agents and provides open-source tooling to advance automated scientific coding.

Abstract

Automated Scientific Discovery (ASD) systems can help automatically generate and run code-based experiments, but their capabilities are limited by the code they can reliably generate from parametric knowledge alone. As a result, current systems either mutate a small number of manually-crafted experiment examples, or operate solely from parametric knowledge, limiting quality and reach. We introduce CodeDistiller, a system that automatically distills large collections of scientific Github repositories into a vetted library of working domain-specific code examples, allowing ASD agents to expand their capabilities without manual effort. Using a combination of automatic and domain-expert evaluation on 250 materials science repositories, we find the best model is capable of producing functional examples for 74% of repositories, while our downstream evaluation shows an ASD agent augmented with a CodeDistiller generated library produces more accurate, complete, and scientifically sound experiments than an agent with only general materials-science code examples.

Paper Structure

This paper contains 15 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: CodeDistiller distills a large collection of Github repositories into a library of reusable scientific code, allowing Code-RAG style scientific discovery agents to perform tasks beyond their parametric knowledge.
  • Figure 2: An overview of the core stages of the CodeDistiller workflow, including identifying the core purpose of the repository, identifying files relevant for building an example, and the example generation and debugging process.
  • Figure 3: Results of A/B testing, showing the proportion of times the judge model preferred the experimental output from the baseline model (with generic materials science code examples) versus the model augmented with a CodeDistiller-generated library. Values represent the average of 50 experimental tasks implemented using CodeScientist.