Table of Contents
Fetching ...

RustRepoTrans: Repository-level Code Translation Benchmark Targeting Rust

Guangsheng Ou, Mingwei Liu, Yuxuan Chen, Yanlin Wang, Xin Peng, Zibin Zheng

TL;DR

RustRepoTrans introduces the first repository-level context benchmark for incremental code translation targeting Rust, bridging the gap between function-level and full-repo translation. The authors construct a 375-task dataset with dependency and test-context, built from real-world projects, and evaluate seven LLMs, revealing substantial challenges in handling repository-level dependencies and architectures. Across analyses, even the strongest models achieve only about 51.5% Pass@1, with self-debugging improving performance but leaving notable gaps, and compilation errors dominated by dependency-resolution issues. The work also provides fine-grained metrics and an enhanced evaluation framework to diagnose and guide future improvements in repository-aware code translation and Rust code generation.

Abstract

Recent advancements in large language models (LLMs) have demonstrated impressive capabilities in code translation, typically evaluated using benchmarks like CodeTransOcean and RepoTransBench. However, dependency-free benchmarks fail to capture real-world complexities by focusing primarily on simple function-level translations and overlooking repository-level context (e.g., dependencies). Full-repository translation benchmarks significantly exceed the current capabilities of existing models, resulting in performance bottlenecks that fail to provide actionable insights for guiding model development. Furthermore, existing benchmarks do not account for the scenario of incrementally translating new or modified modules from the source to the target language, which demands careful handling of repository-level contexts such as dependencies, cross-module references, and architectural divergence. Moreover, LLMs' effectiveness in translating to newer, low-resource languages like Rust remains largely underexplored. To address these gaps, we introduce RustRepoTrans, the first repository-level context code translation benchmark targeting incremental translation, comprising 375 tasks translating into Rust from C, Java, and Python. Using this benchmark, we evaluate seven representative LLMs, analyzing their errors to assess limitations in complex translation scenarios. Among them, DeepSeek-R1 performs best with 51.5% Pass@1, excelling in both basic functionality and additional translation abilities, such as noise robustness and syntactical difference identification. However, even DeepSeek-R1 experiences a 22.2% performance drop (Pass@1 from 73.7% to 51.5%) when handling repository-level context compared to previous benchmarks without such context.

RustRepoTrans: Repository-level Code Translation Benchmark Targeting Rust

TL;DR

RustRepoTrans introduces the first repository-level context benchmark for incremental code translation targeting Rust, bridging the gap between function-level and full-repo translation. The authors construct a 375-task dataset with dependency and test-context, built from real-world projects, and evaluate seven LLMs, revealing substantial challenges in handling repository-level dependencies and architectures. Across analyses, even the strongest models achieve only about 51.5% Pass@1, with self-debugging improving performance but leaving notable gaps, and compilation errors dominated by dependency-resolution issues. The work also provides fine-grained metrics and an enhanced evaluation framework to diagnose and guide future improvements in repository-aware code translation and Rust code generation.

Abstract

Recent advancements in large language models (LLMs) have demonstrated impressive capabilities in code translation, typically evaluated using benchmarks like CodeTransOcean and RepoTransBench. However, dependency-free benchmarks fail to capture real-world complexities by focusing primarily on simple function-level translations and overlooking repository-level context (e.g., dependencies). Full-repository translation benchmarks significantly exceed the current capabilities of existing models, resulting in performance bottlenecks that fail to provide actionable insights for guiding model development. Furthermore, existing benchmarks do not account for the scenario of incrementally translating new or modified modules from the source to the target language, which demands careful handling of repository-level contexts such as dependencies, cross-module references, and architectural divergence. Moreover, LLMs' effectiveness in translating to newer, low-resource languages like Rust remains largely underexplored. To address these gaps, we introduce RustRepoTrans, the first repository-level context code translation benchmark targeting incremental translation, comprising 375 tasks translating into Rust from C, Java, and Python. Using this benchmark, we evaluate seven representative LLMs, analyzing their errors to assess limitations in complex translation scenarios. Among them, DeepSeek-R1 performs best with 51.5% Pass@1, excelling in both basic functionality and additional translation abilities, such as noise robustness and syntactical difference identification. However, even DeepSeek-R1 experiences a 22.2% performance drop (Pass@1 from 73.7% to 51.5%) when handling repository-level context compared to previous benchmarks without such context.

Paper Structure

This paper contains 29 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Motivation Examples from RustRepoTrans of Different Architecture Between Source Language (Python) And Target Language (Rust) Version
  • Figure 2: RustRepoTrans Format Example
  • Figure 3: Prompt for Identifying Equivalent Function Pairs
  • Figure 4: Translating Prompt and Debugging Prompt
  • Figure 5: Pass@1, DSR@1(self-debugging) on RustRepoTrans
  • ...and 7 more figures