Table of Contents
Fetching ...

Translating Large-Scale C Repositories to Idiomatic Rust

Saman Dehghan, Tianran Sun, Tianxiang Wu, Zihan Li, Reyhaneh Jabbarvand

TL;DR

The paper addresses the challenge of translating large-scale C repositories to idiomatic, safe Rust without incurring prohibitive costs or sacrificing correctness. It introduces Rustine, a five-stage, repository-level translation pipeline that combines automated preprocessing, targeted refactoring (including pointer arithmetic handling and constness maximization), dependence-graph analysis, LLM-driven translation with adaptive in-context learning, and rigorous validation plus automated debugging. The approach yields fully compilable Rust translations for 23 real-world C programs, with $87\%$ functional equivalence (based on assertions) and improvements in safety, readability, and idiomaticity, at a low average cost of $0.48$ and runtime of $3.92$ hours per project; where needed, human developers can complete semantic fixes in hours with Rustine debugging support. The work highlights the importance of addressing pointer arithmetic explicitly and provides a comprehensive benchmark and metrics for future C-to-Rust translation research, while outlining directions to extend correctness to concurrent implementations. Overall, Rustine demonstrates a scalable, cost-effective path toward practical C-to-Rust migration and sets a foundation for handling concurrency in future work.

Abstract

Existing C to Rust translation techniques fail to balance quality and scalability: transpilation-based approaches scale to large projects but produce code with poor safety, idiomaticity, and readability. In contrast, LLM-based techniques are prohibitively expensive due to their reliance on frontier models (without which they cannot reliably generate compilable translations), thus limiting scalability. This paper proposes Rustine, a fully automated pipeline for effective and efficient repository-level C to idiomatic safe Rust translation. Evaluating on a diverse set of 23 C programs, ranging from 27 to 13,200 lines of code, Rustine can generate fully compilable Rust code for all and achieve 87% functional equivalence (passing 1,063,099 assertions out of 1,221,192 in test suites with average function and line coverage of 74.7% and 72.2%). Compared to six prior repository-level C to Rust translation techniques, the translations by Rustine are overall safer (fewer raw pointers, pointer arithmetic, and unsafe constructs), more idiomatic (fewer Rust linter violations), and more readable. When the translations cannot pass all tests to fulfill functional equivalence, human developers were able to complete the task in 4.5 hours, on average, using Rustine as debugging support.

Translating Large-Scale C Repositories to Idiomatic Rust

TL;DR

The paper addresses the challenge of translating large-scale C repositories to idiomatic, safe Rust without incurring prohibitive costs or sacrificing correctness. It introduces Rustine, a five-stage, repository-level translation pipeline that combines automated preprocessing, targeted refactoring (including pointer arithmetic handling and constness maximization), dependence-graph analysis, LLM-driven translation with adaptive in-context learning, and rigorous validation plus automated debugging. The approach yields fully compilable Rust translations for 23 real-world C programs, with functional equivalence (based on assertions) and improvements in safety, readability, and idiomaticity, at a low average cost of and runtime of hours per project; where needed, human developers can complete semantic fixes in hours with Rustine debugging support. The work highlights the importance of addressing pointer arithmetic explicitly and provides a comprehensive benchmark and metrics for future C-to-Rust translation research, while outlining directions to extend correctness to concurrent implementations. Overall, Rustine demonstrates a scalable, cost-effective path toward practical C-to-Rust migration and sets a foundation for handling concurrency in future work.

Abstract

Existing C to Rust translation techniques fail to balance quality and scalability: transpilation-based approaches scale to large projects but produce code with poor safety, idiomaticity, and readability. In contrast, LLM-based techniques are prohibitively expensive due to their reliance on frontier models (without which they cannot reliably generate compilable translations), thus limiting scalability. This paper proposes Rustine, a fully automated pipeline for effective and efficient repository-level C to idiomatic safe Rust translation. Evaluating on a diverse set of 23 C programs, ranging from 27 to 13,200 lines of code, Rustine can generate fully compilable Rust code for all and achieve 87% functional equivalence (passing 1,063,099 assertions out of 1,221,192 in test suites with average function and line coverage of 74.7% and 72.2%). Compared to six prior repository-level C to Rust translation techniques, the translations by Rustine are overall safer (fewer raw pointers, pointer arithmetic, and unsafe constructs), more idiomatic (fewer Rust linter violations), and more readable. When the translations cannot pass all tests to fulfill functional equivalence, human developers were able to complete the task in 4.5 hours, on average, using Rustine as debugging support.

Paper Structure

This paper contains 33 sections, 17 figures, 3 tables, 3 algorithms.

Figures (17)

  • Figure 1: Rustine framework consisting of five main components
  • Figure 2: An illustrative example from zopfli project (a), refactored version with no pointer arithmetic (b), and corresponding translations of them (c) and (d)
  • Figure 3: Refactoring of the genann project for resolving unary (middle) and pointer arithmetic (right) operations
  • Figure 4: DG snapshot of Figure \ref{['fig:refactoring']}
  • Figure 5: An example of in-context learning from tulpindicator project. (a) buggy translation; (b) error message; (c, d) ICL example: buggy code; (e) ICL example: fixed code; (f) compilable Rust translation
  • ...and 12 more figures