Table of Contents
Fetching ...

Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation

Xing Zhang, Jiaheng Wen, Fangkai Yang, Pu Zhao, Yu Kang, Junhao Wang, Maoquan Wang, Yufan Huang, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, Qi Zhang

TL;DR

This work tackles repository-level code translation, an area underserved by existing benchmarks that mainly address function-level tasks. It introduces Skeleton-Guided-Translation, a two-step process that first translates repository skeletons and then fills them to ensure coherent inter-file interfaces, and TRANSREPO-BENCH, a 13-task Java-to-C# benchmark with fixed unit tests and testing configurations for automated evaluation. A fine-grained evaluation framework scores translation quality at the unit-test level by executing tests within skeleton-driven environments, enabling partial credit and clearer debugging signals. Experiments across multiple LLMs reveal that skeleton-based translation improves dependency handling and incremental validation, with DeepSeek-v3 achieving the best build rate among the tested models while still facing substantive functional challenges. Overall, the framework advances maintainability, incrementality, and actionable assessment for repository-scale code translation in real-world software modernization.

Abstract

The advancement of large language models has intensified the need to modernize enterprise applications and migrate legacy systems to secure, versatile languages. However, existing code translation benchmarks primarily focus on individual functions, overlooking the complexities involved in translating entire repositories, such as maintaining inter-module coherence and managing dependencies. While some recent repository-level translation benchmarks attempt to address these challenges, they still face limitations, including poor maintainability and overly coarse evaluation granularity, which make them less developer-friendly. We introduce Skeleton-Guided-Translation, a framework for repository-level Java to C# code translation with fine-grained quality evaluation. It uses a two-step process: first translating the repository's structural "skeletons", then translating the full repository guided by these skeletons. Building on this, we present TRANSREPO-BENCH, a benchmark of high quality open-source Java repositories and their corresponding C# skeletons, including matching unit tests and build configurations. Our unit tests are fixed and can be applied across multiple or incremental translations without manual adjustments, enhancing automation and scalability in evaluations. Additionally, we develop fine-grained evaluation metrics that assess translation quality at the individual test case level, addressing traditional binary metrics' inability to distinguish when build failures cause all tests to fail. Evaluations using TRANSREPO-BENCH highlight key challenges and advance more accurate repository level code translation.

Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation

TL;DR

This work tackles repository-level code translation, an area underserved by existing benchmarks that mainly address function-level tasks. It introduces Skeleton-Guided-Translation, a two-step process that first translates repository skeletons and then fills them to ensure coherent inter-file interfaces, and TRANSREPO-BENCH, a 13-task Java-to-C# benchmark with fixed unit tests and testing configurations for automated evaluation. A fine-grained evaluation framework scores translation quality at the unit-test level by executing tests within skeleton-driven environments, enabling partial credit and clearer debugging signals. Experiments across multiple LLMs reveal that skeleton-based translation improves dependency handling and incremental validation, with DeepSeek-v3 achieving the best build rate among the tested models while still facing substantive functional challenges. Overall, the framework advances maintainability, incrementality, and actionable assessment for repository-scale code translation in real-world software modernization.

Abstract

The advancement of large language models has intensified the need to modernize enterprise applications and migrate legacy systems to secure, versatile languages. However, existing code translation benchmarks primarily focus on individual functions, overlooking the complexities involved in translating entire repositories, such as maintaining inter-module coherence and managing dependencies. While some recent repository-level translation benchmarks attempt to address these challenges, they still face limitations, including poor maintainability and overly coarse evaluation granularity, which make them less developer-friendly. We introduce Skeleton-Guided-Translation, a framework for repository-level Java to C# code translation with fine-grained quality evaluation. It uses a two-step process: first translating the repository's structural "skeletons", then translating the full repository guided by these skeletons. Building on this, we present TRANSREPO-BENCH, a benchmark of high quality open-source Java repositories and their corresponding C# skeletons, including matching unit tests and build configurations. Our unit tests are fixed and can be applied across multiple or incremental translations without manual adjustments, enhancing automation and scalability in evaluations. Additionally, we develop fine-grained evaluation metrics that assess translation quality at the individual test case level, addressing traditional binary metrics' inability to distinguish when build failures cause all tests to fail. Evaluations using TRANSREPO-BENCH highlight key challenges and advance more accurate repository level code translation.

Paper Structure

This paper contains 24 sections, 11 figures.

Figures (11)

  • Figure 1: A more detailed quality evaluation to evaluate translated repositories is needed.
  • Figure 2: Input of Translation Task.
  • Figure 3: Framework of Our Evaluator.
  • Figure 4: Resulting Benchmark
  • Figure 5: Framework of the Benchmark Construction.
  • ...and 6 more figures