Table of Contents
Fetching ...

CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation

Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, Isil Dillig

TL;DR

CRUST-Bench presents a repository-scale benchmark for C-to-safe-Rust transpilation, pairing 100 C projects with explicit safe Rust interfaces and test suites to enforce memory-safety and idiomatic Rust patterns. The study evaluates 12 frontier LLMs plus repair and agent-based workflows, revealing that single-shot translation remains challenging while iterative repair and pipeline approaches yield substantial gains, with the best results approaching one-third to nearly half of tasks passing under certain configurations. Key contributions include the dataset construction process, formal interface-driven validation criteria, and empirical insights into common error modes such as type mismatches and borrowing violations, highlighting directions for improving automated migration of legacy C code to memory-safe Rust. The CRUST-Bench dataset and findings have practical significance for teams migrating large C codebases to Rust, enabling more reliable safety-preserving transpilation and guiding future research in LLM-driven code migration.

Abstract

C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.

CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation

TL;DR

CRUST-Bench presents a repository-scale benchmark for C-to-safe-Rust transpilation, pairing 100 C projects with explicit safe Rust interfaces and test suites to enforce memory-safety and idiomatic Rust patterns. The study evaluates 12 frontier LLMs plus repair and agent-based workflows, revealing that single-shot translation remains challenging while iterative repair and pipeline approaches yield substantial gains, with the best results approaching one-third to nearly half of tasks passing under certain configurations. Key contributions include the dataset construction process, formal interface-driven validation criteria, and empirical insights into common error modes such as type mismatches and borrowing violations, highlighting directions for improving automated migration of legacy C code to memory-safe Rust. The CRUST-Bench dataset and findings have practical significance for teams migrating large C codebases to Rust, enabling more reliable safety-preserving transpilation and guiding future research in LLM-driven code migration.

Abstract

C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.

Paper Structure

This paper contains 12 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Example of a CRUST-Bench task: btree-map. Top: The task specification provided by CRUST-Bench, including the C source code (left), a safe Rust interface (middle), and Rust test cases (right). The C code implements the find_value function, which traverses a B-tree map to locate the value for a given key. This implementation relies heavily on raw pointers (e.g., key). In contrast, the Rust interface uses safe, structured types such as Vec<u8>, requiring the transpiler to generate memory-safe, idiomatic Rust. Bottom right: The expected Rust implementation, representing the actual target of the transpilation task. Bottom left: Additional challenges of the transpilation task are highlighted, illustrating the complexity of translating low-level pointer operations to safe abstractions.
  • Figure 2: Application types.
  • Figure 3: Statistics of pipelined SWE-agent with a cost budget of $2. Left: Distribution of steps taken until submission/exit. We see that a majority of resolved test failures ($\sim$80%) are addressed within the first 50 steps of SWE-agent, showcasing early converge when failures are recoverable. Right: Distribution of cost required to fix test failures successfully.
  • Figure 4: Statistics of pipelined SWE-agent with a cost budget of $4. Left: Distribution of steps taken until submission/exit. We see that a majority of tasks are addressed within the first 40 steps of SWE-agent, again showcasing early convergence, even with a higher cost, in cases where errors are recoverable. Right: Distribution of cost required to fix tests successfully. Only 1 task takes over $3.5 to be addressed.
  • Figure 5: Analysis of SWE-agent build and test command invocations across different budget levels. We note that the model invokes the build command more as the cost budget is increased. The average number of test invocations remains the same.
  • ...and 1 more figures