Table of Contents
Fetching ...

What a diff makes: automating code migration with large language models

Katherine A. Rosenfeld, Cliff C. Kerr, Jessica Lundin

TL;DR

This work tackles maintaining software compatibility during dependency semantic-version changes by leveraging large language models (LLMs) fed with diff-based representations of code changes. It introduces AIMigrate, a Python toolkit that constructs diffs between legacy and target library versions and prompts LLMs with pre-update code to generate post-update code, allowing in-context migration with minimal library-specific wiring. Across three diverse case studies (Typhoidsim, Parcels, LangChain BriefGPT), the approach achieves meaningful coverage—up to 65% in a single run and 80% with multiple runs, with 47% of changes produced perfectly—demonstrating that diffs can compress changes and improve LLM-based migration performance relative to code-only prompts. The results also reveal context- and model-dependent performance, highlighting that diff-informed prompts can outperform plain code in some scenarios while under certain conditions black-box prompts remain competitive. Limitations include manual file selection, large context windows for diffs, and the need for human verification; future work points to improved diff construction, scalability, and broader language support.

Abstract

Modern software programs are built on stacks that are often undergoing changes that introduce updates and improvements, but may also break any project that depends upon them. In this paper we explore the use of Large Language Models (LLMs) for code migration, specifically the problem of maintaining compatibility with a dependency as it undergoes major and minor semantic version changes. We demonstrate, using metrics such as test coverage and change comparisons, that contexts containing diffs can significantly improve performance against out of the box LLMs and, in some cases, perform better than using code. We provide a dataset to assist in further development of this problem area, as well as an open-source Python package, AIMigrate, that can be used to assist with migrating code bases. In a real-world migration of TYPHOIDSIM between STARSIM versions, AIMigrate correctly identified 65% of required changes in a single run, increasing to 80% with multiple runs, with 47% of changes generated perfectly.

What a diff makes: automating code migration with large language models

TL;DR

This work tackles maintaining software compatibility during dependency semantic-version changes by leveraging large language models (LLMs) fed with diff-based representations of code changes. It introduces AIMigrate, a Python toolkit that constructs diffs between legacy and target library versions and prompts LLMs with pre-update code to generate post-update code, allowing in-context migration with minimal library-specific wiring. Across three diverse case studies (Typhoidsim, Parcels, LangChain BriefGPT), the approach achieves meaningful coverage—up to 65% in a single run and 80% with multiple runs, with 47% of changes produced perfectly—demonstrating that diffs can compress changes and improve LLM-based migration performance relative to code-only prompts. The results also reveal context- and model-dependent performance, highlighting that diff-informed prompts can outperform plain code in some scenarios while under certain conditions black-box prompts remain competitive. Limitations include manual file selection, large context windows for diffs, and the need for human verification; future work points to improved diff construction, scalability, and broader language support.

Abstract

Modern software programs are built on stacks that are often undergoing changes that introduce updates and improvements, but may also break any project that depends upon them. In this paper we explore the use of Large Language Models (LLMs) for code migration, specifically the problem of maintaining compatibility with a dependency as it undergoes major and minor semantic version changes. We demonstrate, using metrics such as test coverage and change comparisons, that contexts containing diffs can significantly improve performance against out of the box LLMs and, in some cases, perform better than using code. We provide a dataset to assist in further development of this problem area, as well as an open-source Python package, AIMigrate, that can be used to assist with migrating code bases. In a real-world migration of TYPHOIDSIM between STARSIM versions, AIMigrate correctly identified 65% of required changes in a single run, increasing to 80% with multiple runs, with 47% of changes generated perfectly.

Paper Structure

This paper contains 26 sections, 9 figures.

Figures (9)

  • Figure 1: Sizes of the Starsim, Parcels, and LangChain repositories over the course of their commit history (gray). The size of the commit differences is plotted in green. We measure the size using the o200k_base tokenizer and filter for Python files (.py) and exclude docs/ and tests/ directories.
  • Figure 2: Results from the diff comprehension test. The figure plots the LLM output using the code (blue) versus diffs (orange) as a function of the correct answer (the number of functions with an error). Each sub-panel corresponds to a different LLM ($n=50$ test questions per LLM and method).
  • Figure 3: Diagram showing how AIMigrate uses diffs to automate code migration. The diagram does not capture how the pre-update project files are filtered for migration nor does it capture the quality and safety checks applied to the resulting code. The migration loop can be run multiple times to increase overall quality of the results (section \ref{['sec:correctness']}) Figure made with app.diagrams.net/.
  • Figure 4: Figure showing examples of the case studies: time series from a Typhoidsim simulation (left), a comparison of advection kernels in a Parcels tutorial (center), and Q&A in the BriefGPT app (right).
  • Figure 5: Heatmaps showing the mean number of tests passing by case study for each model and migration method combination.
  • ...and 4 more figures