Table of Contents
Fetching ...

LLMigrate: Transforming "Lazy" Large Language Models into Efficient Source Code Migrators

Yuchen Liu, Junhao Hu, Yingdi Shan, Ge Li, Yanzhen Zou, Yihong Dong, Tao Xie

TL;DR

LLMigration introduces LLMigrate, a function-level C-to-Rust translation framework that mitigates LLM laziness by decomposing large modules into small functions, translating each independently, and reintegrating them under call-graph guidance. The system combines function splitting, context probing, and a repair loop with rule-based support to produce safe, compilable Rust and minimizes human edits to under 15% of final code across Linux kernel modules. It demonstrates that function-level translation improves correctness and safety relative to whole-module LLM translations, while the Repair component further increases compilation success. The work highlights a pragmatic hybrid approach that leverages LLMs for idiomatic code generation, complemented by static analysis and program repair to achieve scalable, safe system migrations with real-world impact for large-scale codebases.

Abstract

Rewriting C code in Rust provides stronger memory safety, yet migrating large codebases such as the 32-million-line Linux kernel remains challenging. While rule-based translators (e.g., C2Rust) provide accurate yet largely unsafe Rust programs, recent Large Language Model (LLM) approaches produce more idiomatic, safe Rust programs but frequently exhibit "laziness", omitting significant portions of the target code. To address the issue, in this paper, we present LLMigrate, an LLM-based C-to-Rust translation tool that splits modules into discrete functions, translating them individually, and then reintegrating them. LLMigrate uses static analysis to retain necessary context, pairs GPT-4o (a state-of-the-art LLM) with compiler-driven translation and program-repair techniques for complex core functions, and leverages call-graph-guided translation to ensure consistent interfaces. Evaluations on three representative Linux kernel modules (math, sort, and ramfs) show that LLMigrate requires modifying less than 15\% of the target code, significantly outperforming a pure GPT-4o-based migration.

LLMigrate: Transforming "Lazy" Large Language Models into Efficient Source Code Migrators

TL;DR

LLMigration introduces LLMigrate, a function-level C-to-Rust translation framework that mitigates LLM laziness by decomposing large modules into small functions, translating each independently, and reintegrating them under call-graph guidance. The system combines function splitting, context probing, and a repair loop with rule-based support to produce safe, compilable Rust and minimizes human edits to under 15% of final code across Linux kernel modules. It demonstrates that function-level translation improves correctness and safety relative to whole-module LLM translations, while the Repair component further increases compilation success. The work highlights a pragmatic hybrid approach that leverages LLMs for idiomatic code generation, complemented by static analysis and program repair to achieve scalable, safe system migrations with real-world impact for large-scale codebases.

Abstract

Rewriting C code in Rust provides stronger memory safety, yet migrating large codebases such as the 32-million-line Linux kernel remains challenging. While rule-based translators (e.g., C2Rust) provide accurate yet largely unsafe Rust programs, recent Large Language Model (LLM) approaches produce more idiomatic, safe Rust programs but frequently exhibit "laziness", omitting significant portions of the target code. To address the issue, in this paper, we present LLMigrate, an LLM-based C-to-Rust translation tool that splits modules into discrete functions, translating them individually, and then reintegrating them. LLMigrate uses static analysis to retain necessary context, pairs GPT-4o (a state-of-the-art LLM) with compiler-driven translation and program-repair techniques for complex core functions, and leverages call-graph-guided translation to ensure consistent interfaces. Evaluations on three representative Linux kernel modules (math, sort, and ramfs) show that LLMigrate requires modifying less than 15\% of the target code, significantly outperforming a pure GPT-4o-based migration.

Paper Structure

This paper contains 38 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: The positioning of our approach relative to existing work. The x-axis represents the ratio of safe Rust code generated. The y-axis denotes the accuracy, or success rate, of transplanting automatically or semi-automatically translated code. C2Rust serves as a state-of-the-art rule-based translation tool, while GPT-4 represents a state-of-the-art learning-based translation approach.
  • Figure 2: Translated Rust code snippets of the ramfs module, where the LLM omits essential details ("laziness" problem).
  • Figure 3: The distribution of numbers of code lines (# Line) per function across three modules (math, sort, ramfs) in the Linux kernel. The total number of lines of code for all functions is 496 in the math module, 173 in the sort module, and 379 in the ramfs module.
  • Figure 4: The Probability of Laziness Across Different # Line. As the number of code lines (# Line) increases, the probability of laziness exhibits a rising trend.
  • Figure 5: System overview.
  • ...and 9 more figures