C2RUST-BENCH: A Minimized, Representative Dataset for C-to-Rust Transpilation Evaluation
Melih Sirlanci, Carter Yagemann, Zhiqiang Lin
TL;DR
C2Rust-Bench introduces a minimized, representative benchmark for evaluating C-to-Rust transpilation by selecting 2,905 functions from a 15,503-function real-world corpus using a metric-driven, partitioned sampling approach. The framework combines Maintainability Index, unsafe code complexity, and data-type complexity with PCA-based scoring to ensure diverse coverage while reducing evaluation time. It validates the approach through cross-LLM experiments, hyperparameter tuning, and a final dataset release, demonstrating substantial reductions in both computational cost and codebase size (81.3% function reduction; ~78.9% time savings). The work provides a standardized, reusable resource to enable fair comparisons and rapid iteration of transpilation tools, including LLM-based systems, and highlights practical implications for memory-safety software migration.
Abstract
Despite the effort in vulnerability detection over the last two decades, memory safety vulnerabilities continue to be a critical problem. Recent reports suggest that the key solution is to migrate to memory-safe languages. To this end, C-to-Rust transpilation becomes popular to resolve memory-safety issues in C programs. Recent works propose C-to-Rust transpilation frameworks; however, a comprehensive evaluation dataset is missing. Although one solution is to put together a large enough dataset, this increases the analysis time in automated frameworks as well as in manual efforts for some cases. In this work, we build a method to select functions from a large set to construct a minimized yet representative dataset to evaluate the C-to-Rust transpilation. We propose C2RUST-BENCH that contains 2,905 functions, which are representative of C-to-Rust transpilation, selected from 15,503 functions of real-world programs.
