Table of Contents
Fetching ...

Refactoring Codebases through Library Design

Ziga Kovacic, Justin T. Chiu, Celine Lee, Wenting Zhao, Kevin Ellis

TL;DR

This paper tackles refactoring at scale by reframing it as library design to promote reusability and maintainability, especially as code generation agents tackle broader tasks. It introduces MiniCode, a diverse benchmark with open‑ended design, verifiable evaluation, and multi‑file context, and Librarian, a sample‑and‑rerank method that uses clustering to manage large codebases and MDL‑based ranking to produce reusable libraries. Across synthetic and real‑world codebases (CodeContests, Transformers, Diffusers), MDL consistently aligns with human preferences and yields more reusable abstractions than traditional metrics, while enabling libraries to transfer to unseen tasks. The work demonstrates practical impact by compressing and reorganizing HuggingFace libraries, suggesting a path toward scalable, reusable software design driven by MDL‑guided refactoring. Limitations include dependence on synthetic benchmarks and room for improvement in cross‑cluster reuse dynamics, suggesting future reinforcement learning extensions to further automate library synthesis.

Abstract

Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become used to solve isolated one-off programming problems. We investigate code agents' capacity to refactor code in ways that support growth and reusability. We first investigate what makes a good refactoring, finding via simulation results and a human study that Minimum Description Length best correlates with preferable refactorings. We then present both a benchmark and a method for refactoring: MiniCode, a benchmark where multiple files must be refactored into a shared library, and Librarian, a sample-and-rerank method for generating reusable libraries. We compare Librarian to state-of-the-art library generation methods, and study it on real-world code bases.

Refactoring Codebases through Library Design

TL;DR

This paper tackles refactoring at scale by reframing it as library design to promote reusability and maintainability, especially as code generation agents tackle broader tasks. It introduces MiniCode, a diverse benchmark with open‑ended design, verifiable evaluation, and multi‑file context, and Librarian, a sample‑and‑rerank method that uses clustering to manage large codebases and MDL‑based ranking to produce reusable libraries. Across synthetic and real‑world codebases (CodeContests, Transformers, Diffusers), MDL consistently aligns with human preferences and yields more reusable abstractions than traditional metrics, while enabling libraries to transfer to unseen tasks. The work demonstrates practical impact by compressing and reorganizing HuggingFace libraries, suggesting a path toward scalable, reusable software design driven by MDL‑guided refactoring. Limitations include dependence on synthetic benchmarks and room for improvement in cross‑cluster reuse dynamics, suggesting future reinforcement learning extensions to further automate library synthesis.

Abstract

Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become used to solve isolated one-off programming problems. We investigate code agents' capacity to refactor code in ways that support growth and reusability. We first investigate what makes a good refactoring, finding via simulation results and a human study that Minimum Description Length best correlates with preferable refactorings. We then present both a benchmark and a method for refactoring: MiniCode, a benchmark where multiple files must be refactored into a shared library, and Librarian, a sample-and-rerank method for generating reusable libraries. We compare Librarian to state-of-the-art library generation methods, and study it on real-world code bases.

Paper Structure

This paper contains 42 sections, 13 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of the refactoring problem. A refactoring task comprises a set of files. We refactor the files by designing a new library. Candidate refactorings are evaluated based on a refactoring metric, and are expected to maintain correctness of the original code sources (pass rate). We explore several refactoring metrics in this paper.
  • Figure 2: (A) Asymptotic behavior of metrics for scoring libraries and refactorings (columns) varying refactoring budget (horizontal axes). (B) Comparing metrics via proxies of downstream library quality (total library usage and average calls per library function), for which MDL$>$Tokens$>$MI. All results are estimated using Best@k. See also Appendix \ref{['fig:appendix_asymptotics']}.
  • Figure 3: Best@K MDL ratio. Increasing sample budget improves MDL on Transformers.
  • Figure 4: Human evaluation of different refactoring objectives. Judges compare pairs of refactorings that both pass all test cases. MDL aligns best with human preferences.
  • Figure 5: Example where tokens and MDL diverge: Obfuscating the original library definitions (left) by shortening variable names (right) reduces tokens but increases MDL.
  • ...and 3 more figures