Table of Contents
Fetching ...

C2RUST-BENCH: A Minimized, Representative Dataset for C-to-Rust Transpilation Evaluation

Melih Sirlanci, Carter Yagemann, Zhiqiang Lin

TL;DR

C2Rust-Bench introduces a minimized, representative benchmark for evaluating C-to-Rust transpilation by selecting 2,905 functions from a 15,503-function real-world corpus using a metric-driven, partitioned sampling approach. The framework combines Maintainability Index, unsafe code complexity, and data-type complexity with PCA-based scoring to ensure diverse coverage while reducing evaluation time. It validates the approach through cross-LLM experiments, hyperparameter tuning, and a final dataset release, demonstrating substantial reductions in both computational cost and codebase size (81.3% function reduction; ~78.9% time savings). The work provides a standardized, reusable resource to enable fair comparisons and rapid iteration of transpilation tools, including LLM-based systems, and highlights practical implications for memory-safety software migration.

Abstract

Despite the effort in vulnerability detection over the last two decades, memory safety vulnerabilities continue to be a critical problem. Recent reports suggest that the key solution is to migrate to memory-safe languages. To this end, C-to-Rust transpilation becomes popular to resolve memory-safety issues in C programs. Recent works propose C-to-Rust transpilation frameworks; however, a comprehensive evaluation dataset is missing. Although one solution is to put together a large enough dataset, this increases the analysis time in automated frameworks as well as in manual efforts for some cases. In this work, we build a method to select functions from a large set to construct a minimized yet representative dataset to evaluate the C-to-Rust transpilation. We propose C2RUST-BENCH that contains 2,905 functions, which are representative of C-to-Rust transpilation, selected from 15,503 functions of real-world programs.

C2RUST-BENCH: A Minimized, Representative Dataset for C-to-Rust Transpilation Evaluation

TL;DR

C2Rust-Bench introduces a minimized, representative benchmark for evaluating C-to-Rust transpilation by selecting 2,905 functions from a 15,503-function real-world corpus using a metric-driven, partitioned sampling approach. The framework combines Maintainability Index, unsafe code complexity, and data-type complexity with PCA-based scoring to ensure diverse coverage while reducing evaluation time. It validates the approach through cross-LLM experiments, hyperparameter tuning, and a final dataset release, demonstrating substantial reductions in both computational cost and codebase size (81.3% function reduction; ~78.9% time savings). The work provides a standardized, reusable resource to enable fair comparisons and rapid iteration of transpilation tools, including LLM-based systems, and highlights practical implications for memory-safety software migration.

Abstract

Despite the effort in vulnerability detection over the last two decades, memory safety vulnerabilities continue to be a critical problem. Recent reports suggest that the key solution is to migrate to memory-safe languages. To this end, C-to-Rust transpilation becomes popular to resolve memory-safety issues in C programs. Recent works propose C-to-Rust transpilation frameworks; however, a comprehensive evaluation dataset is missing. Although one solution is to put together a large enough dataset, this increases the analysis time in automated frameworks as well as in manual efforts for some cases. In this work, we build a method to select functions from a large set to construct a minimized yet representative dataset to evaluate the C-to-Rust transpilation. We propose C2RUST-BENCH that contains 2,905 functions, which are representative of C-to-Rust transpilation, selected from 15,503 functions of real-world programs.

Paper Structure

This paper contains 45 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The change of relative difference over combinations of values of two hyperparameters.
  • Figure 2: The relative difference for 9 LLMs.
  • Figure 3: Compilation error fixing attempt distribution of selected and microbenchmark sets for 9 LLMs.
  • Figure 4: The instructions given to LLM for initial transpilation.
  • Figure 5: The instructions given to LLM for fixing compilation errors.