CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation
Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, Isil Dillig
TL;DR
CRUST-Bench presents a repository-scale benchmark for C-to-safe-Rust transpilation, pairing 100 C projects with explicit safe Rust interfaces and test suites to enforce memory-safety and idiomatic Rust patterns. The study evaluates 12 frontier LLMs plus repair and agent-based workflows, revealing that single-shot translation remains challenging while iterative repair and pipeline approaches yield substantial gains, with the best results approaching one-third to nearly half of tasks passing under certain configurations. Key contributions include the dataset construction process, formal interface-driven validation criteria, and empirical insights into common error modes such as type mismatches and borrowing violations, highlighting directions for improving automated migration of legacy C code to memory-safe Rust. The CRUST-Bench dataset and findings have practical significance for teams migrating large C codebases to Rust, enabling more reliable safety-preserving transpilation and guiding future research in LLM-driven code migration.
Abstract
C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.
